THEORY OF STATISTICS 

FOR 

B. A., B. Sc., HONOURS AND OTHER CLASSES 


BY 

J. N. SHARMA, M. A. 

Associate Professor of Mathematics 
Meerut College, Meerut 
& 

J. K. GOYAL, M. Sc. 

Department of Mathematics 
Christ Church College, Kanpur 


All rights reserved with the Author t 



.* 


KRISHNA PRAKASHAN MANDIR 

Publishers — Meerut (Lf P ) 

First Edition MEERUT (U. 1 .) 

Price Rs. I PC(» 




Publishfcd by • ' ‘ p 

B. D. RASTOGI, M. A., B. T. f . 

Proprietor 

KRISHNA PRAKASHAN MANDIR 

Meerut. 



OUR MATHEMATICS PUBLICATIONS 

for 

Post-Graduatei & B. A. (Honours) Students 

Co-ordinate Solid Geometry (Dr. S. C. Mittal M. A., Ph. D.) 
Integral Calculus (Dr. S. C. Mittal & J. N. Sharma) 
Differential Calculus (Dharma Vira & J. N. Sharma) 

Partial Differential Equations & Spherical Harmonics 

(Dharma Vira & Dr. S. C. Mittal) 
Differential Equations (Dr. S. C. Mittal & J. N. Sharma) 
Statics (Profs. Brahmanand, B. S. Tyagi & B. D. Sharma) 
Dynamics (Profs. Brahmanand , B. S. Tyagi & B. D. Sharma) 
Theory of Equations (Dr. S. C. Mittal & J. N, Sharma) 
Algebra (Infinite Series) (J. N. Sharma) 

Vector Analysis (Dr. S. C. Mittal) 

Hydro-Statics (Lr. R. K. Gupta) 

Dynamics of Rigid Bodies (Dr. R. K. Gupta & J. N. Sharma) 
Complex Variables (Dr. R. K. Gupta & J. N. Sharma) 

Theory of Functions of Real Variable 

(Dharma Vira & Dr. S. C. Mittal) 
Determinants (J. N. Sharma) 

Coordinate Solid Geometry II (Differential Geometry. 

(Dr. S.'C. Mittal) 

Attraction & Potential (Dr. S. C. Mittal & T. Singh) 

Sequence and Limits and Uniform Convergence of Series and 

Power Series (J. N. Sharma) 
Theory of Aggregates (Dr. S. C. Mittal & Dharma Vira) 

Theory of Statistics (J. N. Sharma and J. K. Goyal) 

r 1; 


Printed by 
I. D* GUPTA 

At the 

PRAKASH PRINTING PRESS, 

Meerut. 



PREFACE 


This book “Theory of Statistics’’ has been written for the 
use of the students preparing for M. A., M. Sc. (Maths.), 
B. A., B. Sc. (Statistics) and honours examinations of Indian 
Universities. The book will also be found useful for various 
competitive examinations. 

In a subject as extensive as that of Statistics, it is 
inevitable that no single volume, however bulky it may be, can 
adequately cover the whole range of topics. In this book an 
attempt has been made to cover the courses prescribed for the 
above examinations. StuJents are also advised to consult 
various text-books on the subject by eminent authors as well. 

The book has been divided into a number of chapters 
and each chapter deals with a separate topic. Great care has 
been taken to explain the fundamental principles in an exhaustive 
manner. We have tried to treat the subject matter in a clear and 
lucid style. Obscure points which are a source of confusion 
to the students have been thoroughly explained and numerous 
illustrative examples have been added in each chapter. Examples 
have been selected from question papers of various examinations 
as well as from standard text-books. Most of these examples 
have been taken from I. A. S. and P. C. S. papers. The chapter 
on probability which is of fundamental importance in statistical 
theory has been dealt with in some detail. The authors express 
their manifold indebtedness to all the writers whose books they 
have consulted. 

We are grateful to Shri N. Saran, head of the Mathema¬ 
tics Department, Christ Church College, Kanpur, for his help 
in solving some of the problems and to Shri M. P. Tripathi, 
a student of M. Sc. (Final), for some valuable suggestions. 

The authors will feel rewarded if the book is found useful by 
those whom it is intended to serve. Suggestions for the improve¬ 
ment of the book will be gratefully received. 

In the end, we must owe a debt of gratitude to the publishers 
and printers for the utmost efficiency with which they have 
brought out this book. In particular, we are indebted to 
the stair and management of Prakash Printing Press, Meerut for 
their meticulous care and keen sense of printing aesthetics to bring 
out this rather unusual book in the present nice form. 

Meerut 

September t 1963. 


AUTHORS 



CONTENTS 


Chapter Subject Page 

I. Meaning and Purpose of Statistics 1 

IIj^Frequency Distribution and Measures of Central 

Tendency 8 

IIU Measures of Dispersion and Skewness 41 

IV. Consistence of Data and Association of Attributes 60 

V. Finite Differences and Interpolation 78 

VlMProbability 104 

VIIA Continuous Frequency Distributions 155 

VIII. Important Theoretical Distributions 185 

IX. Moment Generating Function and Cumulants 235 

X*-" Method of Least Squares and Curve Fitting 244 

XI. Bivariate Distribution, Regression and Correlation 252 

XII. Multiple and Partial Correlations 284 

XIII. Preliminary Ideas on Sampling 306 

XIV. Simple Sampling of Attributes : Large Samples 311 

XV. The Sampling of Variables : Large Samples 335 

XVI. x 2 -Distribution 354 

XVII. The Sampling of Variables : Small Samples 393 

XVIII. Analysis of Variance 450 

XIX. Index Numbers 482 

Tables 495 



CHAPTER I 


MEANING AND PURPOSE OF STATISTICS 

1-1. Origin of Statistics. The term ‘statistics’ appears to 
have been derived from the Latin status . or, the Italian statista , 
both of which mean a political state. In the mid-eighteenth 

century, the word ‘statistics’ was used to describe “the political 
arrangement of the modern states of the known world”. It was 

in Germany that the systematic collection of official statistics 
began towards the end of the eighteenth century. 

In England, statistics came into being during Napoleonic 
wars. It became necessary to begin a systematic collection of 
numerical data so that the government might be able to assess 
the revenues and expenditures with more precision and to raise 
new taxes in order to meet the cost of the war. Thus the origin 
of what is knowji as descriptive statistics lay in the functions of a 
state pertaining to military or otherwise. Nowadays this word 
has acquired a wider meaning and embraces all branches of 
human activity concerned with numerical data. 

The origin of what is these days known as theory of statistics 
lay in the games of chance. In the mid-seventeenth century, 
the gambler Chevalier de Mere proposed the famous “problem 
of points” to Blaise Pascal. The problem may be described as 
follows : Two persons play a game of chance The pers m 
who first gains a certain number of points wins the stake. They 
stop playing before the game is completed. How is the stake to 
be divided on the basis of the number of points each has won ? 
This interesting problem engaged the attention of the two astute 
French Mathematicians, Pascal and Fermat, who had a lengthy 
correspondence between them before the problem was solved 
The methods used by Pascal laid the foundations of the theory of 
probability on which the modern theory of statistics is based. A 
Temarkable progress was made in the theory of statistics un er 
the influence of Gallon and Karl Pearson (1857-1936). 
Ronald A Fisher has made notable contributions in modern 



2 


STATISTICS 


times to both theoretical and applied statistics. A great deal of 
research work has been done and is being done in the field of 
the mathematical theory of statistics. 

1*2. Definition of Statistics. The term statistics has been 
defined in various ways. To a layman it simply means a 
mass of figures or collection of data. It is true that statistics 
deals with aggregates of figures, but all sorts of numerical 
data are not statistics and before we attempt a formal definition 
of statistics, it is necessary to give some explanation. In physical 
sciences, experimental method is used to know the effect of a 
particular cause by keeping all other causes fixed and varying 
that particular cause This is made possible by the fact that the data 
are completely under the control of the experimenter. This 
advantage being denied to an observer of social sciences, he has 
to depend upon statistical methods which deal with highly 
complicated cases of multiple causation—cases in which a given 
result may be due to any one of a number of alternative causes 
or to a number of different causes acting conjointly; For example, 
the stature of a man is causally connected with his ancestry, his 
race, his diet, his occupation and the place and climate in which 
he is placed. Thus we may define statistics as a collection of objects 
afftcted to a marked extent by a multiplicity of causes. It is a 
science of collecting, analysing and interpreting the numerical 
data. In this sense, statistics is a field of study, a doctrine 
concerned with mathematical characterizations of aggregates of 
items. Secrist defines statistics as “aggregate of facts, affected to a 
marked extent by a multiplicity of causes , numerically expressed , 
enumerated or estimated according to reasonable standards of 
accuracy , collected in a systematic manner for a predetermined 
purpose and placed in relation to each other.” 

The word ‘statistics’ is also used in the sense of plural of 

statistic which is employed to denote a constant of a sample like 

average, standaid deviation, coefficient of correlation etc, which 
we shall discuss later on. 

13. Scope and limitations of statistics. In modern times 
statistical methods are so widely used that it is difficult to 
enumerate the various spheres of their application There is 
no department of human activity where statistics does not creep 
in Whatever your attitude may be towards this subject the 
statistics insistently intrudes into your daily life I ts scope 



MFANING AMD PURPOSE OF STATISTICS 


3 


is stretched over all those branches of human knowledge in which 
a grasp of the significance of large numbers is looked for. 

Today we live in a period when fast changes, social, 
economic or political, are taking place. The economic activities 
are being more and more closely directed to the production of 
industrial goods and our future is being planned to a large extent. 
In order that this planning may be successful, it must be based 
soundly on the correct analysis of complex statistical data. 
Most large industrial and commercial enterprises now employ 
research workers trained in the application of statistical methods. 
Trained statisticians are used by government in various depart¬ 
ments of its administration. Thus statistical methods are used 
in psychology, education, public health and administration, 
agriculture, business, economics and many other spheres of 
human activity. 

Statistics has however certain limitations. In the first place, 
statistics deals with aggregates of objects and does not take 
cognisance of individual items. For example, in finding the 
statures of students of a class, a statistician is not very much 
interested in the heights of individual students but in their average 
height. It is immaterial for him whether a particular student has 
a height of five feet or seven feet. He wants to have a bird’s eye 
view, so to say, of their heights, perhaps by way of comparing it 
with the average height of students of the same class of some 
other college or of a different class of the same institution. 

Secondly, unlike laws of physical sciences, statistical laws 
are not exact. The results of a statistical enquiry are not ex¬ 
pressed in the form of a categorical certainty but in terms of 
probabilities only. Statisticians, you will find, only very rarely 
claim to prove anything. In his training the statistician is taught 
to examine the reliability of his data and the justification of his 
conclusions with the utmost suspicion. Statistical laws are true 
on an average and in the long run. Thus in a coin tossing ex¬ 
periment, we may expect head and tail fifty-fifty in a large 
number of trials provided the coin is unbiased. In other words, 
the probability of getting head in a single throw is £. This does 
not mean that if we toss a coin 100 times, we shall get fifty heads 
and fifty tails. It is possible that the number of heads is seventy. 
This only means that as the number of trials gets larger and 
larger, we shall be coming nearer and •nearer to the possibility of 



4 


STATISTICS 


petting half the number of heads. In the third place, statistical 
methods are applicable only to the study of those facts that are 
quantitatively measurable Qualitative expressions are first 
reduced to precise quantitative terms. For example, to compare 
the culture of two countries, we should know the number of 
industries, educational institutions, hospitals, law courts etc. 
of both the countries. The last and the most important limitation 
is that the statistical data must be handled by experts. Statisti¬ 
cal methods are the most dangerous tools in the hands of an 
inexpert. Statistics is one of those sciences whose adepts must 

exercise the self restraint of an artist. 

1*4. Population and sample. A population may be defined 

as the totality of all actual or conceivable objects under considera¬ 
tion To speak more accurately, a population consists of 
numerical values connected with these objects. The term universe 
is also used for population. Thus we may speak of population of 
human beings of India, population of statures of men of certain 
age groups in a locality, population of lengths of life of electric 
bulbs population of pressures at various points in the atmosphere, 
population of throws of a coin and so on. A population may be 
finite or infinite, existent or hypothetical. 

A sample is defined as a selected number of individuals each 
of which is a member of the population. It is clear that a sample 
will not tell us everything about the parent population from which 
the sample has been drawn, but an attempt may be made to 
estimate certain constants of the parent population. 

There are various methods of estimating a constant (mean, 
standard deviation, moments etc.) from the data of a sample. 
When we have obtained an estimate by one of these methods, 
wc then examine the reliability of that estimate and the degree of 
confidence which can be placed in it. This is usually done in 
terms of probabilities. We shall discuss the theory of sampling in 
some detail in the succeeding sections of the book. 

1*5. Main stages of a statistical enquriy. A statistical enquiry 
passes through various stages which we mention in brief as 

follows : 

Collection of data. The first task of a statistician is to collect 
and assemble his data. He may prepare the data himself or 
borrow them from government, semi-government or non-official 
records. We may call these two types of data as primary and 



MEANING AND PURPOSE OF STATISTICS 


5 


secondary. For the purpose of collecting data, trained and expert 
investigators must be appointed so that the data obtained are 
reliable. It is very essential to examine the reliability of data 
before any conclusions can be based on them. It will be a mere 
waste of time to apply the refined theoretical methods of statistics 
to data which are suspect from the beginning. 

Classification and tabulation of data. When the data are 
obtained, we have to arrange and condense them into some 
suitable form. This is very necessary since the mind cannot grasp 
the significance of a large mass of figures. It is true that some 
information about the individual items will be sacrificed in the 
process of condensation. But, as we have already remarked, this 
individual information is of little interest to a statistician. For 
example, suppose a statistician has got one thousand figures giving 
the incomes of the families of a certain locality to the nearest 
rupee, ranging from rupees one hundred to rupees five hundred This 
haphazard collection of one thousand figures has to be condensed 
before the mind can grasp the significance of the important facts 
contained therein. In this case, we may group together all those 
families whose incomes lie in a certain range, say of twentylive 
rupees. Our total range of four hundred rupees is then divided 
into sixteen sub-ranges, each of twentylive rupees, and we may 
summarise the data by giving the numbers of families who fall 
into the twelve subranges. In other words, the data have been 
tabulated to a suitable scale. 

Analysis of data. First task of a statistician is over when 
he has arranged and tabulated his data. In some cases, the 
enquiry may end at this stage. For example, if a statistician is 
preparing index numbers for the use of an economist, he may 
hand over the number to that person without making any further 
comment. In other cases, he may have to pursue his investigation 
further. He may proceed to the analysis and elucidation of the 
causes which gave rise to the data. There are various methods 
for this purpose. The most important of them is the investigation 
of relationship between phenomena which lead to the theory of 
dependence, contingency, correlation and of finding various 
coefficients in order to examine how far one set of events depends 
upon another. 

Interpretation of data. When the data have been analysed, 
we have to interpret them in order to draw inferences from them. 



6 


STATISTICS 


1*6. Distrust of statistics. To a layman statistics is nothing 
but a huge mass of figures which ‘can prove anything*. This 
belief of his is based on the impressive figures which are put 
forth before him time and again by politicians through speeches, 
by advertising agencies, through newspapers, periodicals and the 
radio, in support of their views and claims. Sometimes 
such figures are justifiably used to form a basis for the 
arguments which are built upon them but generally they are 
meant to mislead. The layman is fully conscious of this fact 
and it is no wonder that he distrusts all arguments based on figures. 
He is to be excused for this belief because he has not the training 
of an expert statistician who can distinguish between right and 
wrong applications of statistics. Thus it is not the science of 
statistics which is to be blamed but persons who handle the 
statistical data. As a matter of fact, statistics hardly tries to 
prove anything. The aim of statistics is to assist in the orderly 
arrangement of the data to which it is applied and to give added 
precision to any conclusion that may be inferred. Statistical 
methods are only tools for handling the numerical data and 
are to be used by expert statisticians. In the hands of an in¬ 
expert, they become most dangerous. Those who say that 
statistics are lies, know nothing of the subject of statistics. However, 
the misuse of statistics can be well understood from the famous 
quotation : “There are said to be three comparisons in lying, lies, 
damn lies and statistics.” It may be noted that the fault does not 
lie with the subject itself but with those who twist and misuse the 
statistical facts, either due to ignorance or to achieve their 
motives. Some examples of such misrepresentations are given 
below :— 

(a) A man went to a doctor and enquired whether he could 
survive a certain surgi;ul operation The doctor told him that 
he would certainly survive since according to statistics 20 per cent 
of the operations were successful and his four patients had already 
died of that operation, and so he would be constituting that 
20 per cent who was to survive. 

(b) A man was crossing a river with his family. He calcu¬ 
lated that the average depth of the river was 3i' while the average 
height of his family members was above 4'. He accordingly thought 
that it was quite safe to cross the river on foot, the result being 
that the whole family was drowned. 



MEANING AND PURPOSB OF STATISTICS 


7 


(c) A statistician collected the figures of death roll in road 
accidents and found that the number of persons dying on foot¬ 
path were 100 while those walking in the middle of the road 
was 5. He argued that it is safer to walk in the middle of the 
road than on foot-path. (The mistake he committed was that 
he did not care to find the ratios of the persons dying to the total 
number in the two cases.) 

Statistical figures used out of context in which they have 
been collected and with suppression of some material information 
may lead to wrong results We often come across contradictory 
arguments advanced on the same facts being supported by 
statistical figures as it is customary to use only those parts of the 
information which are advantageous and ignore the rest, however 
important they may be. 

EXERCISES 

1. Write an essay on the nature and scope of statistics. 

(Agra 1955) 

2. Explain the advantages as well as the drawbacks of the study 

of social problems through numbers. (Agra 1951) 

3. Draw a brief sketch of your idea of the modern statistical 

methods. (Agra 1950) 

4. Discuss the meaning and scope of statistics, and show how it 
can be applied to social and physical sciences. (Agra 1948) 

5. In what ways can statistical methods be misused by interested 
persons ? Give examples. 

6 . ‘ Statistics only furnish a tool, necessary though imperfect, 
which is dangerous in the hands of those who do not know its 
use and deficiencies’. (Bow/ey). Comment on the above 
statement. 

7. What is statistically wrong in the following statements :— 

(a) In a city there were 300 women drivers and 1500 men 
drivers in motor car accidents in a particulir year. 
Women are therefore safer drivers than men 

(b) The number of students in M. A. and M. Sc classes 
of Mathematics in Meerut College has increased almost 
five times during the past ten years. The importance of 
the subject has therefore increased during this period. 

(c) The players A and B play bridge, and the player A wins 

six out of eight games. The player A is therefore superior 
to B. 



CHAPTER II 


FREQUENCY DISTRIBUTION AND MEASURES 

OF CENTRAL TENDENCY * 

2*1. What is the importance of classification and tabulation 
in statistics' ? Mention the considerations in deciding the class 
limits and class intervals . (Agra M. Sc. 1947) 

Show how the limits of classes are defined in grouping a given 
data ? (I* A. S. 1957) 

Consider the rent of forty houses given below :— 

5. 6 . 7, 8 , 11, 15,20, 8 , 11, 25, 30, 15, 17, 11, 6 , 22, 25. 20, 
22, 15, 30, 32. 32, 8 , 20, 25, 22, 22, 35, 37, 40, 20, 11, 25, 20, 10, 
10, 15, 35, 20,42. 

On a perusal of the above data, it is not convenient to form 
an idea of the prevailing rent in the locality and to know whether 
generally the rents are on the lower side say below Rs. 15/- or on 
the higher side say above Rs. 25/-. Now consider the data arranged 
in the following manner :— 

Table I 


Rent 

No: of Houses 

Rent 

No. of Houses 

5 

1 

22 

4 

6 

2 

25 

4 

7 

1 

30 

2 

8 

3 

32 

2 

10 

2 

35 

2 

11 

4 

37 

1 

15 

4 

40 

1 

17 

1 

42 

1 

20 

5 




The above table is known as the frequency table, the rents 
are the variates and the number of the houses in the second and 
fourth columns as the frequencies. The sum of the frequencies 
is 40, the total number of houses considered. Next consider 
the data grouped as under :— 



FREQUENCY DISTRIBUTION 

Table II 


9 


Rent in Rs. No. of houses. 

5—10 (under 10) 7 

10—15 6 

15—20 5 

20—25 9 

25—30 4 

30—35 4 

35—40 3 

40—45 2 

Here the houses are divided into 8 groups according to their 
rents. This grouping gives a clearer picture than first or second 
form. The rent has been divided into classes with difference ot 
Rs. 5 in each class, which is known as class interval. In this case 
the class interval for all the classes is uniform but it is not nece- 
ssary in every grouping. In fact in some cases such as tie 
incomes of the residents of a locality where the number of higher 
income people is small we sometimes have open intervals say 
Rs. 1000 and above. The value of the variate in a grouped 
frequency table is taken to be the middle value or the central value 
of the class. Thus in the first class, it is Rs. 7 5. The number ot 
houses in the second column before a certain class is the d as ^ 
frequency of that class and denoted by/while the mid. value o 

a class is denoted by x. 

The following points should be taken into consideration 
while determining the class intervals :— 

(i) So far as possible class intervals should be equal. 

(ii) The number of classes should be between 10 and 20. 
A frequency table with less than 10 classes may not yield good 
results while with more than 20 classes, the computation and 
calculations are apt to become tedious. 

(iii) The class intervals should be clearly defined. It 
should be clear whether in jthe class a to b, it includes the upper 
limit or the lower limit. In the above table, the class Rs. 5-10 
includes Rs. 5/- but not Rs. 10/- as explained in the bracket and 
this rule shall hold for the whole table. It is much better to give 
a note at the foot of table to explain the method in which class 
the limit values have been included. 



JO 


STATISTICS 


. 2*2. Illustrate graphically the distinction between a frequency 
polygon , a histogram and an Ogive and comment on their uses. 

(B. Sc. Luck. 43, P. C. S. 43, Patna 42) 

Describe the advantages of the graphical r?presentations of a 
frequency distribution and explain the various types of graphs. 

We have seen that grouped frequency distribution gives a 
better idea of the data than an ungrouped one. However if the 
distribution is represented by graphs, a clearer visual conception 
of the distribution is obtained. The various graphs are as 
follows :— 

(i) Histogram, (ii) Frequency polygon , (iii) Frequency curve , 
(iv) Cumulative frequency curve or the Ogive. 

Histogram. Consider the table II of the previous question. 
Take the line OX as the *-axis representing the variates and a 
perpendicular line OY to represent the frequency. Make equal 
intervals of 5 each on the xr-axis and denote these points by 

^4 • • •» 



The frequency for the interval 5-10 is 7; hence erect a rec¬ 
tangle with base a x a 2 and height proportional to 7; similiarly for 
the interval 10-15 erect rectangle with height proportional to 6 
and so on. i he figure so formed is called Histogram, the area 
of a rectangle representing the total frequency in that class. In this 
case it is the rent paid for that class of houses. 

T _ (B. A. Hons. Delhi 51) 

In a frequency polygon, the value of variate is taken to be 
concentrated at the mid-value of the interval. The middle or 
the central values of the classes are marked along the x-axis and 


FREQUENCY DISTRIBUTION 


11 


the ordinates are taken proportional to the frequencies. These 
points are joined by straight lines and a frequency polygon 
b x b 2 b 3 /?4 b 3 b e b 7 b 3 is obtained. (M. Sc. Agra 59) 

The defect of histogram and frequency polygon lies in the 
fact that the value of the variate is taken to be concentrated at 


the mid-value which is not always the case. 

Frequency curve. If in a grouped distribution, the class 
intervals become very small so that the number of observations 
increases, the histogram or the frequency polygon shall tend to a 
smooth continuous curve, known as frequency curve. The distri¬ 
bution for the heights of persons is a continuous one and shall 
give a frequency curve. (B. A. Punjab 53) 

Cumulative frequency curve or Ogive. We may be more interes¬ 
ted in the frequency which a variate takes more than or less than 
a given value rather than the frequency at that value of the variate. 
In the example of the previous question, we may want to know 
the number of houses with rent less than Rs 20/- or more than 
Rs 20/- In this case in place of frequency, we have Cumulative 
frequency and the curve so obtained on the graph as Cumulative 
frequency curve or Ogive. Thus in place ol Table II, we shall 
have 


Rent in Rs. 

Frequency 

Less than 

10 

7 

»» >» 

15 

13 

•9 99 

20 

18 

99 99 

25 

27 

99 99 

30 

31 

9 9 9 9 

35 

35 

99 99 

40 

38 

99 99 

45 

40 

Similarly if the table is to be formed for ‘more tha 

Rent in Rs. 

Frequency 

Equal to or more than 5 

40 

tf 

„ 10 

33 

99 

„ 15 

27 

99 

20 

22 

99 

25 

13 

99 

„ 30 

9 

99 

„ 35 

5 

99 

it 40 

2 

99 

a 45 

0 


we have 


I 






12 


STATISTICS 


The Cumulative frequency curve or Ogive for the above data is 



It will be seen that x-co-ordinate in the first curve is the 
upper limit and in the second the lower limit and not the mid¬ 
value of the intervals. The cumulative frequency table is useful 
in finding the median and quartiles of the distribution. 

2*3. Describe the various forms of frequency curves. 


(i) Symmetrical unimodal curve. 
This has a point of maximum in the 
middle and the curve is symmetrical 
on both sides of the ordinate at this 
point. It shall be seen later on that in 
a symmetrical distribution, the mean, 
median and mode coincide. 



(ii) Moderately asym¬ 
metrical (skew) disribution. 
In this case the frequency 
falls more rapidly on one 
side of the maximum than 
on the other. This is the 
most common distribution. 



(iii) Extremely skew or J- 
shaped distribution. In these dis- 
tiibutions the maximum frequency 
is at one end of the distribution. 
They do not exhibit symmetry 
about any ordinate. Such distri¬ 
butions are found in the case of 
bank balances and other distribu¬ 
tions of income. 

(B. A. Punjab 19531 




FREQUFNCY DISTRIBUTION 


13 


(iv) U-shaped or antimodal curves. These distributions have 
a minimum frequency with rising frequencies on either side of the 
minimum. They may or may not be symmetrical about this point. 



Such distributions occur in the number of patients according 
to age group as children and aged persons are more likely to fall 

ill than young persons 

Example 1 The distribution of a herd oj cows classified 

according to the quantity of milk produced by each cow per week ,s 

symmetrical. The distribution of the same herd classified according 

to the amount of butter-fat produced by each cow per week is ncga- 

tively skew towards the lower quantities. Suggest a possible 
explanation for this fact. 

The distribution of cows classified according to the yield of 

milk per cow per week is symmetrical, i.e. cows yielding equally 
more and less milk than a certain average quantity are equally 
frequent (see fig. I). Had the proportion or % of the fat-conten 
been constant in the milk of all cows, the distribution again should 




11 - 
Fig. 1 ^ . 

have been symmetrical. But it is not so; on t e ot er an 1 * S 

negatively skew towards the lower quantities as shown in fig 2. 
The figure suggests that the abscissae denoting the yield of butter- 
fat in the milk of cows yielding more than the average quantity 
are more propotionately diminished than the abscissae 
denoting the yield of butter-fat in the milk of cows yielding less 
than the average This shows that the milk of the cows which 



14 


STATISTICS 


produce more than the average quantity is poorer in fat-values 
than the milk of cows yielding less than the average. 

This may be possibly due to an attempt of increasing the 
yield of milk without proportionately increasing its fat-value by 
some artificial methods. 

Example 2. A number of perfectly spherical balls , all of the 
same material, give a symmetrical distribution classified according to 
their diameters. Show that, if they are classified according to their 
weight, their frequency distribution will be positively skew towards 
the higher values. 

Let d be the diameter corresponding to the maximum 
frequency /and e the magnitude of each class interval. Let d-e 
and d+e be the two values of diameters on both sides of the 
central value having each a frequency /', since the distribution 
is symmetrical Now when the balls are distributed according to 
their weights, the weights corresponding to d e, dandd + e will 
be propotional to (d—e) 9 , d 3 , (d- f-e) 3 having the same frequencies 




f',jandj'. But here the intervals corresponding to the equal 
frequencies/' are d 9 -(d-e) 9 =3d 2 e — 3de*+e 9 and (d+e) 9 —d 3 
= 3d 2 e + 3de 2 + e 9 which shows that the second interval is greater 
than the first by 6 de 2 . 

It follows that the frequency/' to the right of the maximum 
frequency will be distributed in a larger interval than the frequency 
/'to the left of the maximum, i e. the spread of the frequency 
curve to the right will be greater than on the left. Hence the 
distribution will be positively skew towards the higher values. 

2*4. Measures of Central Tendency. 

Principal Features of Frequency Distributions. Although a histo* 
gram or frequency polygon gives a general idea of the distribution, 
yet it is recessary lor comparison of similiar distributions to have 
an idea of some arithmetical description of the distribution. 
There are certain features which give a general idea of the 


FREQUENCY DISTRIBUTION 


15 


distribution and which can be determined arithmetically. These 
are the measures of central tendency or averages, measures of 
dispersion or variation, skewness, and the peakedness. 

Averages. An average may be termed as the value of the 
variable which represents the distribution and thus known as a 
measure of central tendency. We have a vague idea of the 
averages in our daily life Thus when we say that people in 
Punjab are more prosperous than in Bihar, it does not mean 
that each and every person in Punjab is richer than each person 
in Bihar but all that we mean is that on the average, Punjabis 
are more well to do than Biharis. Since the average is a value 
of the variable, it has the same units as the variable. Thus 
the average height will be in units of height, the average of the 
percentages shall be a percentage. There are different averages 
in use, the common being the Arithmetic Mean, Median, Mode, 
Geometric Mean and the Harmonic Mean. 


2*5. Arithmetic Mean. 


Establish the formula for finding the arithmetic mean from on 
assumed mean ' (Agra M. Sc. 1952) 

If the variable x takes the n values x t , x 2 . .. x n , the arithmetic 


mean 


2 = 


_ T *2~1~ • • » •T-X'n j £ 

n n •_ 


Xi. 


1 


If in the distribution, the variables x x ,x,.,.x 
respective frequencies /,./,,/ 3 .. .f u , the A. M is 

n /l*l +. f* x *+ • « » -f /.,fn 

f\ +/* + • . . +/„ 


...(I) 

„ occur with 


-£/*/?./< ...( 2 ) 

• The formula (2) gives the weighted arithmetic mean, since ti e 
value of the variable X( is weighted by the frequency/* 

If the data have been grouped into classes, so that there is 

more than one value of the variable in a class, the middle value 

of the class interval is taken as the value of variable in that 

class. This gives a fairly good approximate value of the mean 

if the range of x is much greater than the width of the class 
interval. 


The arithmetic mean maybe regarded as the x-co-ordinate of 
the centre of gravity of a plane figure of the form of the histogram 
of the distribution. 


16 


STATISTICS 


If instead of the origin, the values of the variate are measured 
from some assumed mean say a. 




1 

_S—a. 

= 3 *-a. 


...(3) 


where 


N—S fa 

I 


Hence £=*<*+jqEfi (*<““'®) 

= a +Lr/- i 5 i , 

where El is the deviation of * from the assumed origin a. 

When a=X, the R. H. S. of (3) vanishes and hence we have 

The sum of the de viations of the values from Its mean is always 

• • • 

zero. 

In case of grouped frequency table having equal class 

intervals, the calculation of the mean may be done in a more 

convenient way by the substitution 

Xi-a 



where h is the width of the class interval 
or ■ xt^a+uji. 

Now N- J' Xi 

= ^ fi {a+Uih) 

4 * f^vf ^ 

— a + hu. 

The above formula reduces the labour of calculations compared 
with the previous formula and is known as the Step Deviation 
Formula. 

2*6. If there are r series of observations X x% X t ... X n the 
mean M of the whole series is related to the means M x% A/*.. .A/ r 


FREQUENCY DISTRIBUTiON 


17 


of the component series by the equation 

NM =N x M l +N 2 M 2 +... + ;V r A/ r . 

(Agra B. Sc. 1955, I A. S. 1955) 
Let x lJt j=\ t 2, 3,.. .n lt denote the values of the variables 

in the first series. 




or M x N x = Z Ar 1 j=sum of the values of the variable in 

> = 1 


the first series. 

Similarly if Xgi denote the values in the second series 
J~~ *» 2 ,. , ,n t , we have 

"2 

M 2 N 2 "=* ^ Jfgi 

. 7 = 1 

and so on, so that 


M r N r = Z x rf . 
We get 7 


MiHx + M 2 N 2 +... +MrK= j XiJ +£ *„ + ...+ !? x f 

. 7=1 7=1 7=1 

The right side of this equation is the sum of the values in the 

whole series which from above is NM, where N= Z N ( . 

Hence NM=N X M X + N 2 M 2 +... + N,M~' ... (6 ) 

The above proof can be applied to a grouped distri¬ 
bution by multiplying the values of the variables with respective 
frequencies. 


2*7. Median is the value of the middle variable when the 
variables are arranged in increasing or decreasing order, if the 
number of the variables is odd say 2n+J, the value of the (/j+J)th 
variable gives the median value, while if they are even say 2 n, 
then the mean of the mh and (n-H)th value gives the median! 
Thus to get the median height of a class of boys, they should be 
made to stand in a line according to their heights and the height 
or the boy in the middle position ol the line would give the 
median. In the case of a continuous variable, the median may 
be defined as the value of the variable, the ordinate through 
which divides the area of the curve into two equal parts. In the 



18 


STATISTICS 


case of grouped frequency distribution, the median by simple 
interpolation is given by 

--F 

Md = L-\ y — X /, ..«(7) 

where M d is the median, 

L is the lower limit of the median class, 

/is the frequency of the median class, 

F is the total of all the frequencies before the median 

class, 

i is the class interval of the median class. 

2 8. Mode is the value of that variable which has the maximum 
frequency. It is the maximum ordinate of the ideal curve which 
gives the closest possible fit to the actual distribution. When we 
say that ‘Rajputs are brave’, we talk of a mode, i.e. bravery is the 
most frequent feature among the ‘Rajputs’. In a frequency curve, 

it is the ordinate of the point where ^=0, ^<°- For a 8 rou P ed 

distribution, the mode is given by the formula, 

Mode=Z-+2^jy- _ 1 / - ^ •••(8 ) 

where L is the lower limit of the modal class, * the class interval 
of the modal class, f x and the frequencies of the class following 
and preceding the modal class respectively, and / the frequency 
of the modal class. 

Describe geometric mean, harmonic mean, quartiles and 
deciles. 

2 9. Geometric Mean of a series x u x 2 .. .x n is given by 

G = (x lt x 2 ...x n ) 1/n . ...(9) 

For a frequency disribution, 

G=(x/».x/«.. .. .(10) 

where /i +/ 2 +. • • +/»= N • 

Taking logarithms, 

1 n 

log G=-r. 27 fi log Xi. 

" «=l 

2*10. Harmonip Mean is the reciprocal of the arithmetic mean 
of the reciprocals of the variates. 

!_.! ? L 

H N i=l xr 


...(H) 


FREQUENCY DISTRIBUTION 


19 


In case of frequency distributions. 


1 

H 




2*11. Quartiles. If the variates are arranged in ascending 
or descending order, the quartiles are the va'ues of those 

variables corresponding to |th and ^th variables. The lower 

quartile Q t divides the variables between the lowest and the median 
into two equal parts and similarly the upper quartile Q s divides the 
variables between the median and the greatest into two equal 
parts. Similarly if the. variables are divided into ten equal parts, 
the values of the variates at these points give deciles and if divided 
into hundred equal parts, the values at these points are known as 
percentiles. For grouped distributions the general formula is 

Nr— f 

Z r =l p +—/*xi t ...(13) 

Jo 

where Z r denotes the fractile of the rth order, /„ is the lower limit 
of the fractile class, N the total frequency, /„ the total of all the 
frequencies before the fractile class,/„ the frequency of the fractile 
class and i the class interval of the fractile class. 

The above formula gives the median if r=h, quartiles if r = \ 
and $, deciles if r=- 1 1 0 -, A,. .. and percentiles if r = f' 0 v, too, - .. 

2*12. What is the desiderata for a satisfactory average ? Point 
out the special characteristics of A. A/., G. M. t and the mode. 

(Agra M. Sc. ’53, ’56; Agra B. Sc. ’59, ’60) 

In what circumstances would you consider the arithmetic mean, 
the geometric mean and harmonic mean, the most suitable statistics 
to describe the central tendency of the distribution ? 

(I. A. S. ’54, Agra M. Sc. ’63) 

The following are the properties of an ideal average :— 

(i) It should be rigidly defined and its value should be definite 
which can be found mathematically. 

(ii) It should take into consideration all values of the variates. 

(iii) It should be possible to calculate with reasonable ease and 
rapidity. The calculations should not be too lengthy and tedious. 

(iv) It should be least affected by fluctuations of sampling, i e. 
if two independent samples from the same universe are taken, 
their average should not differ considerably. 

(v) It should be possible to treat it algebraically. Thus 


20 


STATISTICS 


knowing the averages of the component series, the average of the 
whole series should be expressible in terms of the averages of the 
component series. 

Relative merits and demerits of different averages. 

(1) Arithmetic Mean. 

Advantages : 

(i) It is w'ell defined and based on all observations. 

(ii) It can be easily calculated. 

(iii) It gives weight to all items in direct proportion to their 
sizes. 

(iv) In most cases it is not affected by fluctuations of sampling. 

(v) It lends itself easily to algebraic treatment. 

Disadvantages : 

(1) It sometimes gives values which may not physically be 
possible e.g. average number of births in a locality as 8*7 per day. 

(ii) It gives undue weight to extreme values eg. a high 
salaried man may drag up the average salaries of the staff in an 

establishment. 

(iii) It cannot be calculated in cases where the extreme ends 
are open e.g. the distributions of income with the end value ‘above 
R.s. 1000’ or ‘below Rs. 50’. 

(2) Median. 

Advantages : 

(i) Like the mean, it is easily calculated, well defined and 
based upon all observations. 

(ii) It does not give undue weight to very high or very low 
values of the variable 

(iii) It is possible to calculate the median even in cases where 
the end intervals are open. 

(iv) It can be found out in cases where the values of the 
variable cannot be found definitely e.g. beauty, intelligence. 

Disadvantages : 

(i) It does not lend itself to algebraic treatment i. e. the 
median of the sum or difference of pairs of corresponding observa¬ 
tions in two series in not the sum or difference of the means of 
the two series. 

(ii) It may sometimes be indefinite, when the Dumber of 
items in the modal class are large. 

(iii) Median may sometimes be located at the point where 
the frequency may be quite small, thus it may not represent the 
typical feature of the distribution. 


FREQUBNCY DISTRIBUTION 


21 


(3) Mode. 

Advantages : 

(i) It is easily calculated. 

(ii) It can be found out from the graph by a mere look on it. 

(iii) It can be found out in distributions where the end values 
are open. 

(iv) Mode is very easily comprehensible. Thus the statement 
‘Indians are religious-minded’ means that most of the Indians are 
religious-minded or the religious-minded people in India form the 
modal class. 

Disadvantages 

(i) Its algebraic treatment is difficult except in continuous 
distributions. 

(ii) It is not well defined. 

(iii) Its value may not be well defined. 

(4) Geometric Mean. 

Advantages. 

(i) It gives weight to each and every item. 

(ii) It is amenable to algebraic treatment. 

(iii) It is suitable in cases where ratios are given, e. g. index 
number of prices and in quantities whose changes tend to be 
directly proportional to the quantity itself e. g. populations. 

Disadvantages : 

(i) It is difficult to calculate. 

(ii) It vanishes if a single value of the variable is zero, and 
may become imaginary for negative values of the variables. 

(iii) It cannot be found out in cases of open intervals. 


(5) . Harmonic Mean. 


Advantages. 

(i) Based on all values. 

(ii) Can be treated algebraically. 

(iii) Harmonic mean is useful in cases like finding the 
averege speed when the speed for different parts of the distance 
is given in distance per unit time. Thus if a train moves 5 miles 
with speed 2i m. p. h , 10 miles with speed 30 m. p. h. and 20 
miles with speed 10 m. p. h. f the averege speed is 


5 +10+20 _ 

X+ « + «“" 


which is the Harmonic Mean. 


p. h. 



22 


STATISTICS 


Its disadvantages are that it is difficult to calculate. 

It may be noted that Arithmetic Mean is the most convenient 
and widely used form of the average and should always be used 
unless there are strong reasons in some particular cases against it. 

2*13. Empirical Relation between Mean, Median and Mode. 

For moderately asymmetrical distributions, the relation 

Mode = Mean —3 (Mean —Median), 
approximately holds. For a symmetrical distribution , the mean , 
median and mode coincide. 

2*14. Solved Examples. 

1. Show that for a J shaped distribution with the maximum 

frequency towards the lower values of the variate , the median is 
nearer to Q x than Q 3 . (M. Sc. Agra 54) 

It is easier to understand the problem with the help of a 
frequency curve Since the area between Q x and the median is 
equal to the area between the median and 
Qx, each being equal to }th of the area of 
the whole curve, and since in this case the 
frequencies of the ordinates decrease rapidly 
with increasing values of the variate, it is 
clear that the distance between the abscissae 
of the median and <2, is smaller than 
between Q 3 and the median. (See the 
figure ) The same argument can be stretch¬ 
ed in cases of discrete distributions with 
the help of the histogram. 

2. If there are two variables u and v in which the value Ui 
corresponds to the value v f for each i, and a new variable Z=au + bv 
is formed, show that 

Z=au-\-bv. 

Let the number of variables for each u and v be n. 

Now Z% — aU( -f* bvf t 

n n n 

so that Z Zi = a £ i /,-+£ Z 

1=1 1=1 *—1 

Dividing by n on both sides, we get 

, Z=au-\-bv. 

Similarly if Z=a,u,-f ***/*+<V/ a +.. .+a m u mt 

z = °v u i 4- a. t/ 0 4- <V'3 4- * • • 4- a m u m 



we get 



FREQUENCY DISTRIBUTION 


23 


K 3 . Show that the weighted arithmetic mean of first n natural 
numbers whose weights are equal to the corresponding numbers is 


equal to 


2n + l 

3 * 


1.1+2.2 + 3.3 + ...+W n 
1 + 2 + 3 +. ..-+■/! 

E n 2 
Z n 


n (n + 1 ) ( 2 /t+l)//i (» + n 
6 / 2 

2 «+J 
3 * 


4 . The frequency distribution below gives the cost of produc¬ 
tion of sugarcane in different holdings. Obtain the arithmetic mean. 


Cost of 

No. of 

Cost of 

No. of 

Production 

Holdings 

Production 

Holdings 

2—6 

1 

18— 

52 

6 _ 

9 

22 — 

36 

10 — 

21 

26— 

19 

14- 

47 

30—34 

3 



(Audit and Accounts 41) 

Taking the 

middle value of 

the classes as the values in the 

classes. 




m.v 

f 

m. v 


( x ) 


(*) 


4 

1 

20 

52 

8 

9 

24 

36 

12 

21 

28 

19 

!6 

47 

32 

3 

^ Taking out 

assumed mean at x 

= 20 and putting 

i =*; — 20 , 

5 

f 

K 

/ 

— 16 

1 

0 

52 

— 12 

9 

4 

36 

— 8 

21 

8 

19 

— 4 

47 

12 

3 


— 16 — 108— 168 — 

188 + 0+144 + 152 + 36 


1+9 + 21+47 + 52 + 36+ 19 -f- 3 


c=> 

— Ho = —*82. 





24 


STATISTICS 


Hence 

=20—*82 
= 19*18. 

The solution could be still easier if we had applied the 
Step Deviation Method. 



*i—20 

4 * 


u 

/ 

u 

f 

-4 

1 

0 

52 

-3 

9 

1 

36 

-2 

21 

2 

19 

-1 

47 

3 

3 


Zfui -4-27- 

63- 

-47+36+38 + 9 


Xfi “1+9+21+47+52 + 36 + 19+3 


-37 
180 * 

X = a+h ~ Zfui 



5*2. Determine the quartiles and the median for the following 
table :— 


Income No. of persons 


Below Rs. 30 

Between Rs. 30 and below Rs. 40 

69 

167 

T9 

Rs 40 „ 

M 

Rs. 50 

207 


Rs. 50 „ 

It 

Rs. 60 

65 


Rs. 60 ,, 

)» 

Rs. 70 

58 

tf 

Rs 70 „ 

I* 

Rs. 80 

27 


Rs. 80 and 

over 

Total 

10 

603 


(Bombay B. Com. 1942) 


FREQUENCY DISTRIBVTION 


25 


The total frequency is 603 and the ^tta item i.e. 302nd 

item falls in the group ‘between Rs. 40 and Rs. 50’. Now 
applying the formula of equation (7), 

M d =L-\ - j— x /', 

L=40,/=207, F=69 + 167 = 236, /= 10, N= 603. 

M d =40+—~^- 6 xl0 

=40+^=Rs. 43-2 nearly. 

The median income is Rs. 43*2 nearly. 

TV 4-1 

To find Q lt we see that the - 
in the Group ‘Rs. 30 and Rs. 40’. 


th i.e. 151st person comes 


i "F 

Q^L+Z-fzrxi. 


Here L = 30, ^-151, F=69,/-=167, /= 10. 


/. Q t = 3 0 + ii|^?-xl0 


30+ 


820 


167 

= Rs. 34*9 approx. 

For third quartile Q 3t we have ^-^—-^- — 453. Here 

that 453rd person falls in the group Rs. 50 to Rs. 60. 

3 (AH-1) _ f 

Q,=L+ ■* XI 

= J0+ ™-*» x ,q 


we see 


65 




*=Rs. 51*5 approx. 

The upper quartile is Rs. 51’5 and lower quartile Rs. 34-9. 



26 


STATISTICS 


6. Find the mode and the median for the following 
distribution :— 


Variable 0-5 5-10 10-15 15-20 20-25 25-30 30-35 35-40 
Frequency 25 7 13 21 16 8 3 

(U. P. P. C. S. 1958) 

The highest frequency is the fifth class i.e. 20-25. Referring 
to equation (8), we have 

Mod e=L + f-£± 7 xi 

V -ji-j-i 

L = 20,/=21,/_ 1 =13,/ l =16, i=5 
M ° de = 20 +42^T3* 5 


= 20 +n 

= 23*08 nearly. 
M d =L+iy-x/. 

Here N=75 and hence the middle item i.e. — 
item in the fifth class /. e. 20-25. 

£=20, ^=38. F=27,/=21, / = 5. 


or 38th 


A/ d = 20 + - 8 2T ? 7 x 5 

=20 + -** 

= 22*62 approx. 

4 

7. The monthly incomes of 10 families in a certain locality 
are given below : 


A BCDEFGHIJ Total 
85 70 15 75 500 20 45 250 40 36 1136 


Calculate the arithmetic average, the geometric mean and the 
harmonic mean af the above incomes, which one of the above three 
averages represents the above figures the best ? Give reasons, 

(Agra M. A. 1955) 

The number of observations are 10 


A. M.= 


1136 

10 


= Rs. 113-6. 



FREQUENCY DISTRIBUTION 


27 


log G =-/o [leg 85 + log 70-f log 15 + log 75-f-log 500+log 20 

4-log 45-flog 250 + log 40 + log 36] 
= 1 * 0 -[1 92942+ 1*84510+ 1 ’1”609 + 1 87506! +2 69897 

+ 1*30103+-1*65321+-2 39794+ 1 60206+ 1*55630] 
= 1*803418. 

.*. G = Rs. 63 59. 


* , 1 bV + 7 V 4- 1 V + ‘7 V + 6 O (J + 2 V + 4.V + T O O 4" '40 4" 3 l 0' 

A, s ° _ =-io 

* 011765 +* 014286 + '06 6667 + 013333 + 002 
_ +* 05 + *02222 +* 004 + 025 + -0277 8 

10 


= *0217251. 

.*. //=Rs. 42*09 approx. 

We see that the Geometric Mean represents the given 
figures the best since there are eight out of tan families with 
incomes less than the A. M. while the harmonic mean represents 
the lower income groups only. 

8. Show that in finding the arithmetic mean of a set of 
readings on a thermometer , it does not matter whether we measure 
the temperature in Centigrade or Fahrenheit degrees, but that in 
finding the geometric mean it does matter. 

Let a set of N thermometric readings in Centrigrade 
degrees be C\, C 2 , C a ..Cn and the corresponding readings in 
Fahrenheit degrees be F x , F 2 , F 8 ,.. .F n . 


In fact, we have 

F r =32+ \C T% r— 1, 2, 3,. ..N. 

Now the arithmetic mean of the N readings in Centigrade 
degrees 

= A/ C = -»> (Gj + C 2 + ... + Cn) 


N 


and the same in F degrees 


M f = F, 

U 32;v+ ^ c 0 


= 32 +§A/ C 
= A/ C in C degrees. 



28 


STATISTICS 


But Geometric Mean in C degrees 

Gc=(C 1 C : C,...Cn) 1/A ' 
and G M in F degrees 

=Gf=(F,F,/v..F n ) 1M ' 

-[(32+SC.) (32+?C 2 ).. .(32+f C n )]>/^ 

=32 Ul + xfoC,) (1 +iSbC,) (l+rS-oC s )...(l + T ?irrN)] ,/ - v . 

=32 (‘ + T^Af Cl+ •• 0(* + i^oiv Cs+ •••)••■(' + Teojv Cn+ "' 

=32+^^ (C,+C a +...C n) + •.. 

^Gq. 

9. (a) Explain the quickest method of finding the average price 
per seer of wheat for 20 years when the given data give the number 
of seers per rupee for 10 years and the price in rupee per seer for 
next ten years. 

(b) A man motors from A to B. A large part of the distance 
is up hill and he gets a mileage of only 10 miles per gallon of 
gasoline. On the return trip he makes 15 miles per gallon. Find 
the harmonic mean of his mileage. Verify the fact that this is the 
proper average to use by assuming that the distance from A to B is 
60 miles. (U. P. P. C. S. ’58) 


(a) Calculate the harmonic mean of the data for the first 
ten years and the arithmetic mean for the next ten years. The 
arithmetic mean of these two averages gives the averege price for 
the 20 years. 





= 12 miles per gallon. 


When the distance from A to B is 60 miles : 

In the onward journey be consumes 6 gallons and in return 
journey the consumption is 4 gallons. 

The total consumption of petrol =10 gallons. 

Distance covered = 120 miles. 

Average consumption = Vr 

= 12 gallons per mile. 

Hence the humonic mean is the proper averege in this case. 


10, A motor when travelling from rest travels the first 
twentieth of a mile at 6 miles per hour and the next three twentieth 



FREQUENCY DISTRIBUTION 


29 


at respectively 8, 12, 24 miles per hour but its average speed over 
the first fifth of a mile is not 125 miles per hour but 9 6 Explain 
the apparent paradox, (U. P. P. C. S. ’59) 

is the average of the speed but the 

average speed of the motor is the harmonic mean and not the 
A M. of the speeds, Since 

total distance travelled 
the average speed = , otal time 0 f J0 ur..J>r' 

Average speed= x , _ 1 _ 5 . , _i_ 

T2 0 + 180 + 240 + 480 

= 9*6 m. p. h. 

11. A variate takes values a, ar , ar 2 ,.,., ar n ~ l each with 

a ( J — r u ) 

frequency unity. Show that the A. M. is ^ ^ j—rj ■ ^‘ * S arU '~' u * 


and H. M, is 
2 = 


an (1 —r) r n "i 
7-r" "" * 

a-\- ar -for 2 4- • • . -f a r n 

n 


t 


= ajA-r») = M 
n (1 — 0 

G. M .= {a.ar.ar*.,.ar n - l }' /n 
— Q (/-l+ 2 + 2 + # • 

1 a + arar 2+ "'^cr- 1 

H.M “ n 

= l—r" 
an (1— r) r n ~ l ‘ 

H M= an -± l -— )r - 
• n ’ i—r" 


12. The distribution x x , x 2 . x n with frequencies 

f\> ft, ... f n is transformed into the distribution X lt X 2 , .. . X n with 
the same corresponding frequencies by the relation X r =ax r -\-h 
where a and b are constants. Show that the mean, median and 
mode are given in terms of those of the first distribution by the same 
transformation. (B. Sc Agra 61) 



30 


STA T1STICS 


Since 


Xi=aXi-\-b, i=l, 2, 3, 

2x 4- - 2* b 
£ f t £fi ' + £fi ’ 





2 b=bN and 2 f—N 
• • 
t - I- 


Hence the arithmetic mean is obtained by the same trans¬ 
formation. 

As regards the median, we are only to find the value of the 
variable corresponding to the middle item and as such it is not 
affected by the change of the origin or the scale. The median 
item in the frequency distribution will remain the median item 
in the transformed distribution and hence its value in the latter 
would be represented by the same transformation. 

X\Ja = ax M*+ b - 

The same argument holds for the value of the mode. 

13. The following Table gives the marks obtained by a batch 
of candidates in a certain examimtion in History and Politics. In 
which subject is the level of knowledge of the candidates higher ? 
Give reasons. 


Roll No. 

History 

Politics 

Roll No. 

II : story 

Politics 

1 

42 

46 

9 

40 

30 

2 

24 

20 

10 

62 

61 

3 

38 

41 

11 

55 

50 

4 

35 

43 

12 

54 

63 

5 

30 

25 

13 

52 

45 

6 

45 

54 

14 

47 

56 

7 

58 

47 

15 

43 

58 

8 

50 

36 





(M. Sc. Agra 55) 


The sum of the marks obtained by the batch of candidates 
in the two subjects is equal and hence the mean marks in each 
subject are the same. Arranging the marks in ascend ng order 
of magnitude in each subject, we have 

History 24, 30, 35, 38, 40, 42, 43, 45, 47, 50, 52, 54, 55, 58, 62. 

Politics 20, 25, 30. 36, 41, 43. 45, 46, 47, 50, 54, 56, 58, 61, 63. 


FREQUENCY DISTRIBUTION! 


31 


In ascending order, the marks of the — t?, i.e. 8th candidate 

in History are 43 and in Politics 46. These are the median marks 
in the two subjects. Hence the level of k nowledge in Politics is 
higher than in History. 

14. The rise in prices of a certain commodity was 5% in 1954, 
8% in 1955 and 77% in 1956. It is said that the average price rise 
between 1954-56 was 26% and not 30%. Justify the statement and 
show how you would explain it to a layman. (M. A. Agra ’62) 

The price at the end of 1954 was 105 in proportion to 100 in 
1951, similarly at the end of 1955 it was 108 as against 100 in 1955 
and at the end of 1956 it was 177 against 100 in 1956. Hence the 
average price at the end of a year, taking the price in that year to 
be 100. is given by the geometric mean of 105, 108 and 177. 

G. M =(105 X 108 x 177) ,/3 . 


On taking logarithms and simplifying, we get 

G. M.= 126 nearly. 

Henee the average price rise was 26%. 


/5 + o + 77\ — 


The arithmetic mean of the rise in prices is f ^ i/u __ JU/0 


J% = 30% 


but if rise in the price had been 30% in each year, the price at the 
end of 1956 would be 100x~j x |^?x which gives a much 
higher value than the value that we get from the given data, 
i.e. 100x~ X I?® x|^. The later is very nearly given by the 

value lOOx jqq x j - qq x Jq^ which we get by the geometric mean 
calculated above. 


EXERCISES 

1. The element tin consists of a mixture of ten isotopes which 
have atomic weights ranging from 112 to 124. The propor¬ 
tions in which the isotopes occur in the element are given in 
the table, calculate the mean atomic weight of the mixture. 

Isotope A. W. 112 114 115 116 117 118 119 120 122 121 
Percentage 1*1 0 8 0 4 15*5 9*1 22*5 9 828 5 5 5 6 8 

(118-8 A. W.) 

2. Explain the short cut method of calculating the arithmet c 
average. The following data relate to sizes of shoes sold at a 



32 


STATISTICS 


store during a given week. Find the average size by the short 
cut method : 

Size of shoes 4*5 5 5*5 6 6 5 7 7 5 8 8 5 9 9 5 10 10 5 11 
No of pairs 1 2 4 5 15 30 60 95 82 75 44 25 15 1 

(M. A. Cal. 1936) 

3. The following is the frequency distribution of a random 
sample of . 509 employees by weekly earnings. Calculate the 
average weekly earnings : 

Weekly earnings 10- 12- 14- 16- 18- 20- 22- 24- 

No. of employees 3 6 10 15 24 42 75 90 

Weekly earnings 26- 28- 30- 32- 34- 36 38 40 

No. of employees 79 55 36 26 19 13 9 7 

(I. A. S. ’48) (26 14 approx.) 

4. Calculate the arithmetic mean, median and the quartiles from 
the following distribution of 100 persons by age : 

Age last birthday 15-19 20-24 25-29 30-34 35-39 40-44 

Number 4 20 38 24 10 4 

(A. M. 28*9; Median Age 28’49; ^=25 164; {? 3 =32*86) 

5. Show that the median of a variable is the abscissa of the 
point of intersection of its two ogives (of less than and greater 
than types). 

6 . Compute the mean, median and mode for the followiug 
frequency distribution : 

x 40-49 50- 60- 70- 80- 90- 100- 110- 120- 130- 140- 150- 160- 
/ 1 2 3 5 17 65 69 79 37 19 7 3 2 

(106*862, 108-413; 111*423) 

7. The time by your watch is 10 : 31 o'clock. In checking with 

two friends, you find that their watches gave the time as 
10 : 25 and 10 : 34. Assuming that the three watch are 
equally good time pieces, what do you think is probably the 
“correct time” ? (Mean 10 : 30) 

8 . Calculate the geometric mean and the harmonic mean of the 
following monthly incomes of 20 families in rupees : 

2000; 35; 400; 15; 40; 1500; 300; 6; 90; 250; 20; 12; 450; 10; 
150; 8; 25; 30; 1200; 60. (G. M. Rs. 98*37; H. M Rs. 26 07) 

9. A car travels at a speed of 30 m. p. h. for the first 40 miles, 
then at a speed of 35 m p. h. for the next 40 miles, then at a 


FREQUENCY DISTRIBUTION 


33 


speed of 45 m. p. h. for the next 40 miles, again at a speed of 
38 miles for the next 40 miles and at speed of 35 miles for 
the next 40 miles. What is the average speed of the car at its 
j° urne y ? (35-97 m. p. b.) 

What measures do you suggest for each of the following 
distributions : 

(a) Incomes of workers in a factory, (b) Heights of students, 
(c) Number of petals of flowers. (M. A. Punjab ’62) 

[(a) mean, (b) median, (c) mode] 



CHAPTER in 

MEASURES OF DISPERSION AND SKEWNESS 
3.1. While averages give an idea of the central tendency 

of the given distribution it is necessary to know how the values 

of the variate are dispersed about the central value. Thus to 
say that the average income of a number of persons is Rs. 250/- 
per month does not give an idea whether a large number among 
them are earning nearly this amount or some of them may be 
earning in thousands while a number of them very small amounts 
bringing the avegrage to Rs. 250/-. In fig. on page 45 we have 
two frequency curves with the same mean but while in A the 
values are clustered near the mean, in B the values are dispersed 

widely on both sides of the mean. 

The commonly used measures of dispersion are 

(i) Range, 

(ii) Quartile deviation. 

(iii) Standard Deviation. 

(iv) Mean Deviation about (a) mean (b) median. 

3*2. Range is the difference between the greatest and 
least value of the variate. It is extremely easy to calculate it and 
gives a very general idea about the distribution. Its use is very 
much limited as it does not take into account the central tendency 
or the form of the distribution. 


3*3. Quartile Deviation or Semi Inter Quartile Range is the 
mid-value of the interval between Q x and Q 9 


Q 2 * 


...( 1 ) 


If the distribution is symmetrical, it is equal to the difference 
between the median and Q x or Q z 

Q=M d -Q x =-Q i -M d . ...(2) 


3.4. Standard Deviation or the Root Mean Square Deviation 
from the Mean. If we take our measure of dispersion as the sum 
of the deviations of the values of the variate from the mean, some 
of them may be of opposite signs and thus the value so obtained 
may be very small even though the actual dispersion from the 


MEASURES OF DISPERSION AND SKEWNESS 


35 


central value may be large. In order to remove this discrepency, 
either we take the mean of the absolute deviations from the 
average value or square the deviations to do away with the 
negative sign of the deviations Standard Deviation generally 
denoted by a is the square root of the mean of the squares of the 
deviations from the mean 

\/ {n f (Agra B. Sc. 1956) 

where N is the total frequency. 

In case of a frequency* distribution 

...( 3 ) 

If the deviations are measured from some other assumed 
mean say A, we have 

2ft (*,-A)*=27/ {(*,-s)+(*-A)} a 

=27/, (x<-«)*+2 (X-A) 27/, ( Xi -X) + (X-A)* 27/ 
as X and A are constant quantities. Now since the sum of the 
deviation of the variates from x is zero [see (5) of § 2*5], we have 

2ft (*<-/*)*=27/ (*,-s)*+(s-/l) a 27/ 

or ~27/(*,-,4) 2 «jy 27/ (*,-S) 2 + (S -A)* 

or ...(4) 

putting X —A —d. 


V{^f ft (x ‘- x)I 


Thus we see that the mean square deviation is least when 
measured from the mean since the value of the L. H. S. shall 
be least when 

X—A=d=0 

or the deviations are measured from the mean so that A = x. 

(Agra B. Sc. 1959) 

Now x=A+L Zf, ( X( -A) 

or £f t (x ( -A). 

Hence substituting i ^ =u { in equation (4), we have 

^ zfhw=o*+{^zf( Xl -A)y 

-o’+^hZ/.u,} 



36 


STATISTICS 


or o 2 = h 2 1 ^ fiU*— (jft 2f iW<) 2 J. • • • (5) 

This gives the step deviation formula for computation 
of S. D. 

3.5. Mean Deviation or the Average Deviation. Another 
measure of dispersion in which the absolute values of the 
deviations are considered is used and is known as Mean Deviation. 
The mean deviation about the mean 


4f/,!*-* I- 

Like S. D. it considers only the positive values of the devi¬ 
ations hut the S. D. is more convenient to nse mathematically than 
the mean deviation. 

3.6. Show that in a descrete distribution, the mean deviation 
is least when measured from the median, 

(Punjab B. A. 1958. Agra M. Sc. 1953, B. Sc. 1958) 
Let the given values of the variable be arranged in ascending 
order of magnitude, so that 

Xj ^ ^ X s ■ • • ^ X n , 

(i) Consider the the case when n is even, say 2m. Now if 
we consider some assumed value of the variable, say A , we see that 

I * 2 m-A | + | x^-A | 

is minimum when A lies between x x and x im . Similarly 

I *2m-i -A | + | x 2 -A | 

is minimum when A lies between x 9 and Xg m -]. Hence consi¬ 
dering the sum of the absolute values of the deviation from values 
equidistant from the ends i. e. | x 2wi _ r — A J -f | x^-A | , 
r=0, 1,2, 3... n, we see that sum is least when A lies between 
x m and x m+1 for | x m ~ A | + | x m+1 -A | to be least. Hence 
the mean deviation is least when A lies between the two middle 
values of the variable. 

(ii) Let n be odd say 2m-fl. Extending the argument as 
in case I, we see that 

I *2ro+l A | -f- | X l —• A J 

is least when A lies between x x and x 2m+1 . Similarly it should lie 
between x 2 and x 8 and x 2m _„.. .,x m and x m+2 , for least value 
of the sum of absolute deviations from these pairs of values. 
Finally the deviation from x m+1 is zero when A=x 2mH which 
is the median value of the variable. 

Hence in both cases, we see that the mean deviation is 
least when measured from the median. 



MEASURES OF DISPERSION AND SKEWNESS 


37 


3*7. Show that the mean deviation about the mean is less than 
the standard deviation. (M. Sc. Agra ’49, B. A. Hods. Delhi ’49) 

We have to show that 

\/\\,z /(*<-*>*] > j/ff' I *.-* I 

or N (*,-*)* > {2 f, | x,-H | }* 

0 ' 9 

Or ( fi +f 2 + • • • +/«) ( fl X* ■+“••• +/«■*’’ n) 

> {/l I *1 I +/* I I + • • • +/» I I >* 

where — S = A',. 

Simplifying and bringing all the terms on one side, we get 

/x/ a W+ Af 2 2 -| | | JT t 1} +/i/a W+AT, 1 -| Af t { I AT, |} + ... > 0, 

which can be written as 

Zfifi W + A7-I AT, 11 A; 1} > 0. i*j 

i= l 

J-l 

or ATJ-lAT,!) 1 > 0, l*i 

i= l 

j = l 

which is always true. Hence the statement. 

3*8. Solved Examples. 

1. Show that in a discrete series if the deviations x from the 
mean M are so small that the third and higher po ver of 

f. and — can be neglected t the following relations are found to hold 
M M 

approximately : 

(i) G=M (/—2 Ap)' (M. Sc. Agra ’61. B. Sc. Agra ’60) 

(ii) M 2 — G i =o*. 

(Hi) //=A/(/-^ x ). 

(iv) M+H~2G. (M. Sc. Agra ’63) 

(v) Mean Vx=m(j — ~^. 

(i) We have 

log C=^/ ( log {Xi + M) 



38 


STATISTICS 


=| ? [f / ‘ ,08Af+ f /ilog ( 1+ ®)] 

= ^[>08 MSft+Sf, ($-%&+ — 
=log M+j^ 2 f, x ‘~ 2 M* f f ,x ‘*+ 


M 


=log M n ^2M* 
since E f { —N and £/<x,=0. 

f t 

G “ M exp- ("at f iMi) 

['4f@+-] 

From the above relation, we have 

) 


(ii) 


giving 

(iii) 


=Af 2 ( 1 -—+. 

JV/*_G 2 = a*. 


A/* 


• • 


- 1 = 1-2 
> * t p ^ 


fi 


H N i Xi+M 

l N^{ l +W 




9 


M' M* 




or 


AT* A/ ^ ^ NM* Z ^ X<+ JVA/ 8 ^ * 

1 _ i _ 1 _ a«\ 

A/" 1 " A/ 8 a A/ \ 'M*/ 

“ A/ ( i ’“a7 5+-, ‘) 

= A/ ^——^approximately. 

(iv) From (i) and (iii), we get 

m - c =4 and 


MEASURE OF DISPERSION AND SKEWNESS 


39 


dividing we get 


M-G _\ 
M-H 2 ’ 


which gives 


M±H = 2G. 


(v) Mean 2 f y/(x t + Af) 


y/M 

N 

y/M 

N 


f f ‘ ( 


1 + 


X A 

m) 


i+I*l 

1 '2 M 


1 


1/2 


-1 *±L+ \ 

8 M*^'") 


1 1 


“ f ^ <+ 2 Ny/M Z *' Xi 8 NM 3,t ^ 

"t" • • • • 

= y/M -g ^3 72 a* 


=VA/( , -J0 

^2. Calculate the standard deviation of the following two series 
which shows greater deviation ? 

Series A 

192 288 236 229 184 260 348 291 330 243 

Series B 

83 87 93 109 124 126 126 101 102 108 

(P. C. S. ’38) 


Series A Series B 


X 

Size of 
the items. 

$=*-260 
Deviation from 
the assumed 
mean 

V 

y 

Size of 
the items 

r]=y— 105 

Deviation from 
the assumed 
mean 

*) 2 

192 

— - | 

— 68 

4624 

83 

-22 

484 

28? 

+ 28 

784 

87 

— 18 

32 4 

236 

— 24 

576 

93 

-12 

144 

229 

-31 

961 

109 

4 

16 

184 

-76 

5776 

124 

19 

361 

260 

0 

0 

126 

21 

441 

348 

88 

7744 

126 

21 

441 

291 

31 

961 

101 

-4 

16 

330 

70 

4900 

102 

— 3 , 

9 

243 

-17 

289 

108 

3 

9 

£ x 
= 2601 


ZV 

- 26615 

Zy 
= 1059 


zy 

= 2245 





40 


STATISTICS 


Series A 


S=A. M.= 


Zx 

n 


2601 


10 


=260-1 


a * t== ~ (x—260*1 ) a 

=-A- r(x-260)*-(260-l-260) 8 

= 2661*5—*01 

=2661*49 


a a =(2661*49) ,/2 =51*6 approx. 

Coefficient of variation=x 100 

& 


__ 51-6 
260-1 
= 19*8 


x 100 




Series B 


1059 

10 


= 105-9 


Z (jv—105)*—(105*9-105)* 

=- 1 1 0 J x 2245—*81 
= 223-69. 
a y = 14*96 approx. 


Coefficient of variation 


|*x 100 


14*96 

105*9 


X 100 


= 14*1. 

\y Since the coefficient of variation for series A is greater than 
for B the A series shows greater variation than B . 

3. The following table shows the number of workers in two 
factories whose weekly earnings are given against them. Determine 
the mean values of weekly earnings and standard deviation in both 
eases : 


MEASURES OF DISPERSION AND SKEWNESS 


41 


Range of weekly 

Number of workers 

earnings in Rs. 

Factory A 

Factory B 

4— 6 

74 

71 

6 — 8 

376 

379 

8—10 

304 

303 

10—12 

110 

112 

12—14 

18 

18 

14—16 

0 

1 

16—18 

9 

3 

18—20 

9 

9 

20—22 

0 

4 

(M . A. Cal. ’3' 


Factory A 


Weekly 

earnings 

Mid-value 

No. of 
workers 

£< = *, —9 

1 

Mi 



fi 


4— 6 

5 

74 

-4 

-296 

1184 

6 — 8 

7 

376 

-2 

-752 

1504 

8—10 

9 

304 

0 

0 

0 

1C—12 

11 

110 

2 

220 

440 

12—14 

13 

18 

4 

72 

288 

14—16 

15 

0 

6 

0 

0 

16—18 

17 

9 

8 

72 

576 

18—20 

19 

9 

10 

90 

900 

20—22 

21 

0 

12 

0 

0 


N=Zf EfU Z ffj 

= 900 =-594 =4892 


Mean-s=t 27/,r,+9 ^ 

=- 594 +9 

9C0 ^ 

= Rs. 8*34. ^ 

S. D. = a,=[i £/&*-('„ S/A')'] 1 '* 

“[906 x« 9 2-C66) , r 

«=Rs. 2*24 


42 


STATISTICS 


Factory B 


Weekly 

earnings 

Mid-values 

y t j 

No. of 
workers 

i ft 


J 

•N 

M. 

t* 

4— 6 

5 

71 

-4 

-284 

! 1136 

6- 8 i 

7 

379 

-2 

-758 

! 1516 

8—10 

9 

303 

0 

0 

0 

10—12 

11 

112 

! 2 

224 

448 

12—14 

13 

18 

; 4 

72 

288 

14—16 | 

15 

I 

6 

6 

36 

16—18 

17 

3 

1 

I 8 

24 

192 

18-20 

19 

! 9 

10 

90 

900 

20—22 

21 

! 4 

12 

48 

• 576 


i 

* 

II 

• 

2 ftVt 

] s fn? 



= 900 


| =-578 

= 5092 


y= 9 +^/<i. 


= 9 + 4 -( 5 78 ) 


Rs. 8-36. 




v/ x 


=[<4* 5092 

=[5' 66—(*64) 2 ] 1/ * 
= Rs. 2-29. 


578YT /a 
v sooj J 


4. Calculate the mean deviation from the following data . 
What light does it throw on the social conditions of the community ? 

Difference in age between husbaud and wife in a particular 


community. 

Difference in years 

Frequency 

Difference in years 

Frequency 

0—5 

449 

20—25 

109 

5—10 

705 

25—30 

52 

10-15 

507 

U> 

o 

i 

u> 

xn 

16 

15—20 

281 

35—40 

4 



(Bombay B. Com. 1930 



MEASURES OF DISPERSION AND SKEWNESS 


43 


Diffe¬ 

rence 

in 

years 

Mid 
value x { 

Frequ¬ 
ency f 

u — 

x { — 12*5 

5 

1 

| jt, -10-5 | 

10'5| 

0—5 

2-5 

449 

-2 

-898 

8 

3592 

5—10 

7-5 

705 

— 1 

-705 

3 

2115 

10—15 

12-5 

507 

0 

0 

2 

1014 

15—20 

17-5 

281 

1 

281 

7 

1967 

20—25 

22*5 

109 

2 

218 

12 

1308 

25—30 

27-5 

52 

3 

156 

17 

884 

30—35 

32-5 

16 

4 

64 

22 

352 

35—40 37-5 

4 

5 

20 

27 

108 

i 



N= 

2123 


ZfUi ‘ 
— 861 


Zfi \ *,-10-5 1 

= 11340 


S=12*5+~ 2fiU t 


“12-5+™ (-864) 


Mean Deviation 


= 12‘ 5 — 2'03 
= 10*47 
= 10-5 nearly. 

= I I 


1 

2123 
1 _ 
2123 


Eft I x 4 -10* 


x11340 



= 5*3 nearly. 

5. For a frequeucy distribution of marks in History of 
2 00 candidates (grouped in intervals 0— 5, 5 — 10,... etc.), the 
mean and standard deviation were found to be 40 and 15. Later 
it was discovered that the score 43 was misread as 53 in obtaining 
the frequency distribution. Find the corrected mean and standard 
deviation corresponding to the corrected frequency distribution. 

(I. A. S. 1957) 


For the uncorrected distribution, the mean is 40 and the 
number of students /. e. the total frequency is 200 

40 = y $3 Ff t X(. 

;. Zfx t = 8000, 





44 


STATISTICS 


where x { is the mid-value of the class interval and/< the frequency 
of the ith class. 

Since in the uncorrected state, 43 was misread as 53, 53 
must have come in the class 50—55 with mid-value 52-5 while the 
corrected value should have been 42*5. 

Hence Sf t x t (corrected)=8000—52*5+42*5 

=7790. 

7790 


• • 


Mean (corrected)= 


200 

= 39*95 marks. 

Similarly 2f f x? (uncorrected)— 2fa? (corrected) 

=(52*5) 2 —(42*5) 2 . 

Now ZfiX* (uncorrected)=AT(o 2 -fAf 2 ), 

=200 {(15, 2 +(40) 2 } 

=365000. 

Zf t x t 2 (corrected)=27/^ 8 (uncorrected)—(52*5)*-f(42*5) 2 

=365000—(95 X 10) 

=364050 

a 2 (corrected)= 2fa* (corrected)—A/ 2 (corrected) 


364050 


— (39*95) 2 . 


• • 


200 

0=14*974. 

The corrected mean is 39*95 and standard deviation 14*974. 

3*9. Moments, The rth moment of x about any origin a 
is given by 

where r is a positive integer and N=S /<* 

I 

When the origin is taken at the mean of the distribution, it is 
called a central moment and is denoted by 




i 


Mo # = ^ r ^/<=l = ^o 
= s—a=d (say) 


Now 


MEASURES OF DISPERSION AND SKEWNESS 


45 


Relation between Central moments and moments about any 
arbitrary origin. We have, 

= jy % A (Xi—a—x—a) r 

Zfi (x i ^- f x 1 ') r . 

Expanding by binomial theorem, 

Vr= l N [Zf( {Xi-aY-'C^Zfi ( Xl -a)'-'+... 

+ (- l) p 'C> p 'r/, (x,-a)'-p+...+(- U r Hi'Zfi) 

-M/- r C lft V^+'C l/t| Vr-.+ ...+(- l) r /V'. 

In particular 

_ 9 •9 

f*2=/*2 —^1 

/ i 8= s M3 , — 3^ 2 Vi' + 3^| , Vi # — — 3/i *'//,' + 2#i,' a 

A'« = *V“”^M 3 V/ + 6fi*Vi' 2 “ 4 MiV/ 3 + a*i' 4 

= Ha- 4/*»Vi' + 6/^ 2 Vi' 2 — 3/i/ 4 . 

It is well to remember that /i 2 =(Standard Deviation) 3 and is 
known as Variance. 

It can be proved that 

^/=^r + f Ci/iiVr-i+ r C 2 / ll , V r - 2 + ... +(/O r by putting 
/*r' = jy *£/< (X|-fl) r ~~ 27/, + 

The first moment /a,' (about the origin) gives the mean and is 
a measure of the central tendency, the second moment f i., about 
the mean is known as variance 
and is a measure of disper¬ 
sion. The third moment 
about the mean indicates the 
symmetry or asymmetry of 
the distribution, it being zero 
for a symmetrical dis¬ 
tribution. The fourth 
moment /* 4 about the 
mean is a measure of 
kurtosis or the flatness 
of the frequency curve 




46 


STATISTICS 


as explained afterwords. 

The above three figures 
give some idea of the dis¬ 
persion, symmetry and kur- 
tosis in frequency curves. 

In positively skew distribu¬ 
tions, the mean is towards 
the right side of the mode or median and in the negatively skew, 

it is towards the left. 

The moments of the higher orders than four can also be 
applied to determine the various characterstics of the distribution, 
<?.£. all odd order moments of symmetrical distributions vanish but 
since the calculations in these high order moments is very tedious, 
they are not generally used. 

3*10. Solved Examples. 

1. Write short notes on 

(i) Sheppard's corrections. (Agra B. Sc. ’60) 

(ii) Kurtosis. (Agra B. Sc. ’60, ’56, ’59) 

(Hi) Coefficient of variation. 

(iv) Char tier's checks. (Agra B. Sc.’60) 

Sheppard’s corrections. While calculating mean and higher 
moments of grouped frequency distributions, the mid-value of the 
class interval is taken to be the value of the variate in that interval 
resulting in the assumption that the frequencies are concentrated 
at the mid-values. It causes certain errors in calculation of the 
moments. W. F. Sheppard, proved that 

nf (corrected) — 

h 3 

p 2 (corrected )=/x 2 — 

/j 3 (corrected)* 

(corrected) = ^-UV2+a4o/i‘, 

where h is the width of the class interval, provided, that 

(i) Frequency curve of the distribution is continuous. 

(ii) The frequency tapers off to zero at both ends. 

(iii) The number of classes is not too large. 

Kurtosis —Karl Pearson has defined four important coefficients. 

ft-'7L y.=+Vft; 

r 2 



MEASURES OF DISPERSION AND SKEWNEES 


47 


q —*±JL 

P2-„ 2» 

M2 


y 2 =^2“3 = 


/i 4 —3 m 2 2 


M2 


As can be seen, these coefficients are all pure numbers and are 
independent of the units of the variable, since n n is of the order 
(variable)". fi x is the measure of skewness and is zero for a 
symmetrical distribution. 

/? 2 is the measure of flatness or peakedness of a single humped 
distribution. For a normal distribution 0 2 =3 and hence any 
distribution having 0 2 > 3 or y 2 > 0 will be peaked more 
sharply than the normal curve and is known as lepto-kurtic 
(narrow) while if 0 8 < 3 or y 2 < 0, the distribution is termed as 
platy-kurtic (broad). 


Coefficient of variation. The significance of a measure of 
dispersion of Rs. 40 00 in the income of persons . with an average 
income of Rs 100*00 is much more than that with the same 
dispersion in the income of Rs. 1000*00. A measure of 


dispersion known as coefficient of variation is given 
where M is the mean and a the standard deviation. 



Also coefficient of dispersion for quartile deviation 

__ Q a Qi 
Q *+<2 x 

Charlier’s check. To guard against mistakes in the calculation 
of mean and standard deviation, it is well to use some checks. 
Thus Z /(£+l) = r fl+Z f 

iV<s+i)*=^e+227/s+r/ 

Hence if we calculate Eft? and 27/(5+ l) 2 ; it gives a very 
good check on our calculations. The method is illustrated in 
examples given at the end of this chapter. 

2. Show that for a discrete distribution 0 2 > /. 

(M. Sc. Agra ’53, ’56, B. Sc. Agra ’58, ’61) 

We have to show that 


or J where X t =x t -x 

or ZfL f t X* > {ZfXff 

or (/i+/a+ ... +/n> (f i X l i -^f 2 X 3 i ... +/„ AV) 

> f n X n 2 ) % , 



48 


STATISTICS 


Simplifying and bringing all terms on one side, we have 
fxh X\X£) +/i fz W + XS-lXfXJ) 

4-/./. W+Xf-2XfX 3 *)+... 


0 


or 27 fifi W-Xf)' > 0, 

i*j 

which is true. Hence the result. 

3. Explain what you mean hy skewness. What formula 
would you use for measuring skewness ? Show that skewness ranges 
from -7 to +1. (M. Sc. Agra ’52, ’57; B. Sc. Agra ’57) 

Skewness is the lack of symmetry and is termed as positive 
if the longer tail of the frequency curve is towards the higher 
values of the variate or in other words, the mean is greater than the 
mode or median, and negative in the reverse case. For a symmetri¬ 
cal distribution, the mean, mode and median coincide. There are 
two formulae for coefficient of skewness. 

(i) Karl Pearson’s coefficients of skewness 
Mean-Mode 

Sk ‘“' S. D. * 


.... 0.+Gi-20._(03-<W-<G.r2i> 

00 03-01 <03-0 2 ) + (03-0l>’ 

where 0 2 is the median and 0, and Q 3 the two quartiles. 

The first coefficient ranges from —3 to 4-3, but since mode is 
not a well defined average, generally the second coefficient is used. 
It can be easily seen that since 0, < 02 ^ 03» the absolute 
value of the numerator, is less than the denominator, this coeffi¬ 
cient of skewness lies between 4 1 . The coefficient of skewness 
is a pure number and is zero for a symmetrical distribution. 

4. The deviation of a distribution is measured from a value 

differing from the mean of the distribution by x. Show that if x is 
plotted against the corresponding mean square deviation, the points 
lie on a parabola . (B. Sc. Agra 61) 

The mean square deviation for a value x is x s . Now if x is 
plotted against * 8 , the resulting curve is a parabola with the vertex 
as the origin. 

5. The first three moments of a distribution about the value 
2 of the variable are 1, 16 and —40 . Show that the mean is 3, the 
variance 15 and /i 3 = — 86. 

Also show that the first three moments about x=0 are 3, 24 
and 76. (Agra M. Sc. ’63, Agra B. Sc. 63) 


MBASURES OF DISPERSION AND SKBWNESS 


49 


We have 

i 27/,(*,-2) = I 27/,*,-2 

= 1. (given) 

Hence i 27/x, = 3. 

Hence the mean of the distribution is 3. 

Also ^ 27/ (*,-2)*=! 27/ (*,-3+l)= 

= ^ [2fi (*.-3)* + 227/ (x,-3) + 27/} 

= ^ 3+1 
= 16. (given) 

Hence /x 2 (variance) = 15. 

Again ^27/ (x,-2)’=^27/(x,-3 +1) 3 

[£/. (*,-3)’ + 3 27/, (*.-3) a 

+ 3 27/, (X—3)+27 /] 
= ^a+3/x a +l = —40 (given). 

.*. /x 3 = —86. 

The distribution is negatively skew, the longer tail of the 
frequency curve being towards the lower values of the variate. 

Second part of the question is left as an exercise for the 
students. 


6. Show that if the class interval of a grouped distribution 
is less than one third of the calculated standard deviation, Sheppard’s 
correction makes a difference of less than \% in the estimate of S. D. 

(B. A. Lucknow *48) 


From Sheppard’s correction, we have 

h 2 

(corrected) = ^2—^2* 

, a h 2 


where a x is the corrected S. D. 

( h* v'» 

a ' =a \}-w) 

(‘ 24o 2 ^" 




50 


STATISTICS 


— <71=24^ t0 the ^ fSt a PP rox * mat ‘ on 


216- 


Si nee 


Hence 


h< r 


a —a 


1 < _ < _ 
216 200* 


Hence the correction makes a difference of less than 
7. Given sizes, means and standard deviations when measured 
from the mean , how would you obtain the mean and standard devia¬ 
tion obtained by pooling the two samples ? 

(B. Sc. Agra ’59, ’61; Delhi B. A. Hons. ’49, Madras ’53) 
Let the sizes, means and standard deviations of the two 
samples be N n m r and o- r respectively (r^l, 2) and let these values 
for the combined sample be N, m and a. 

A^A^ + AT., 

Nm=N l m l -f- N 2 m 2 . 

Nowl! =ivrb; /*■+ ?/*■} 

= A^-* N 2 { N iV-t-A^ 2 5 2 ‘} 

or (a J +i»i) 2 =J^ {N, [o*+m*) + N 2 (* 2 * + m 2 *)} 
or a 2 =. f Af 1 /w l 8 +Ar > w e > _ (N x m x + N,m 2 \ 2 \ 

^1 + ^3 \ N x + N t \ N'l+N~) } 

= . N t N t , xs . 

Mi+N a + (N l +N i )*( mi 
In general if there are l samples, / > 2, then 
o*= a i 2 + A^a 2 a 4-,. . 4- A^ct, 2 , 1 

•W (/»!—AT t Af 8 


r N t o r 2 

r= 1 


.£>• tf, 


i Z , N ‘ N ) i^j. 


If in (1) above 


= /7? s 

0 a = AV.'+A^v 


+ 'V, 


MEASURES OF DJSPBRSION AND SKEWNESS 


51 


The variance of the combined series is the arithmetic mean 
of the variances of two series weighted by their sizes when the 
means of the two series coincide. 

8. Find the mean, standard deviation and skewness of the 
distribution 


Variable 
0—5 5—10 
Frequency 
2 5 


10—15 15-20 20—25 25—30 30-35 35- 40 

7 13 21 16 8 3 

(U. P. P. c. S. 1958) 


Variable 


0—5 
5—10 
10—15 
15—20 
20—25 
25—JO 
30—35 
35—40 


mid- Fre- w* = 

value** quency/ t *<—17*5 


2 5 
7*5 
12 5 
17 5 
225 
27-5 
32-5 
37*5 


fiU t 


27 / <w< = 6 5 


ivicau- 

Skewness=—. —r- 


17*5 V N hZf Ui 
■ 17 , 5 + 7 J if x 5 x 66 

H x250-(4*4>* 
63*97 

7 99 

8 0 approx. 

Mean - Mode 
S ta nd^d’C^vjaiioo. 

3-08 ' : 


£>-23-08 
' 8*0 

’ ■ 


fui 



18 

20 

7 

0 

21 

64 

72 

48 



4 





52 


STATISTICS 


Note. In Q. No. 6 on page 26 of the previous chapter, the 
mode has been found equal to 23*08. 

9. Find the quartile and Mean Deviations as well as coeffi¬ 
cient of skewness from the following figures :— 

Weights in lbs . No. of persons Weights in lbs. 


70—80 

80—90 

90—100 

100-110 


12 

18 

35 

42 


110—120 

120—130 

130—140 

140—150 


No. of persons 

50 

45 

20 

8 


(Agra B. Com. 1940, M. A. 1952) 

N=Zfi=230 
• 

N 230 n _. . 

2 ~Y~ 115th item, 

which lies in 110—120 group, 

A/ d =L+£_ x / 

= 110+ — X 10=111-6. 

^= 2 J° = 57-5 

4 4 

The lower quartile Q x is 57’5th or 58th item which lies 
in 90—100 group. 

gi==90+ 57 ^~ 3 ° xl0 

=90+7*85=97*85. 

Similarly the upper quartile is the 230 x2th or 172*5th or 
173rd item in 120—130 group. 

• a=12 0+i^=il 7 xl0 


Also 


45 

= 120+— = 123-44. 

Skewness = — 1 ~^ a ~ 2 ^ a - 

== 97 , 85+123-44—2 x 111*6 
123-44-97*85 
- 191 

- 25*59 ~ — ’ 07 nearly. 


MEASURES OF DISPERSION AND SKEWNESS 


53 


v/ 10. The means *of two samples af sizes 50 and 100 respec¬ 
tively are 54’ 1 and 50'3 and the standard deviations are 8 and 7. 
Find the mean and standard deviation of the sample of size 150 by 
combining the two samples. (Lucknow 1943) 


X= 


N x + N. t 
_(50x54*l)4-(100x50*3) 


150 


2705 + 5030 
150 


= 51*57. 


Also 


AV,*+ AW . N,N. 

hJ ' \Ti~ \ X l~ X 2> 


N 

(50 x 64) + (100 x 40) , 50 x 100 . 

-150-+ — <ri iT (54-1-50*3) 

32004-4900 


150 


(I50) 2 
+ f (3*8)* = 57*2089. 


a=7*56 approx. 

11. In a certain test for which pass marks are 30. the distri¬ 
bution of passing candidates classified by sex (boys and girls) were 
as given below : 


Marks 

Frequency 



Boys 

Girls 

30—34 

5 

15 

35—39 

10 

20 

40—44 

15 

30 

45—49 

30 

20 

50—54 

5 

5 

55—59 

5 

— 


70 

90 


The overall mean and standard deviation of marks for boys 
including the 30 failed were 38 and 10, The corresponding figures 
for the girls including the 10 failed were 35 and 9. 

(a) Find the mean and standard deviation of marks obtained 
by the 30 boys who failed in the test. 

(b) The moderation committee argued that the percentage 
°f passing marks among girls is higher because the girls are 
very studious and if the intension is to pass those who are really 
intelligent , a higher pass mark should be used for girls. With¬ 
out questioning the propriety of this argument, suggest what the 



54 


STATISTICS 


pass marks should be which would allow only 70 percent of the girls 
to pass. (I. A. S. 1959) 


Marks 

Mid 
value x { 

f 

(boys) 

Ui 

*,-42 

5 

/<«< 

/<«,* 

fi 

(girls) 

MW 

fi'u? 

30—34 

32 

5 

-2 

-10 

20 

15 

-30 

60 

35—39 

37 

10 

-1 

-10 

10 

20 

-20 

20 

40—44 

42 

15 

0 

0 

0 

30 

0 

0 

45—49 

47 

30 

1 

30 

30 

20 

20 

20 

50—54 

52 

5 

2 

10 

20 

5 

10 

20 

55—59 

57 

5 

3 

15 

45 

0 

0 




70 ! 

) 


35 

125 

90 

-20 

120 


X x (mean for successful boys)=42+v 4 o E/m 


=42+- 7 Vx 35=44*5 marks. 


<jj* (successful boys)= Efu^-Q^ hZfu^ 

—f§ x 125—(2*5)*. 

Hence o t =6* 2. 

The overall mean for 100 boys 

~v_ N x X x -\-N 2 X 9 
Ni + N> 


or 


giving 


70x44*5 + 30^ 
70+30 

^0=22*83 approx. 


Now the overall standard deviation for 100 boys is given as 10. 
Since Vi-*#, we have 

O0 , = 20xl6^* + 7g0 (44 . 5 _ 22 . 83)1 

or •3tr a 2 ==—26*845—*21 (21*67)*+100. 

Since it gives a negative value of a a *, it seems that there is 
some mistake in the given data. 


(b) We arrange the marks scored by the girls as under : 

Marks: 55-59 50-54 45-49 40-44 35-39 30-3 \ below 30 

Number of girls: 0 5 20 30 20 10 10 















MEASURES OF DISPERSION AND SKEWNESS 


55 


The total number of girls appearing is 100 and we want the 
lower limit of the group in which upper 70% of the girls fall or 
starting from the higher marks, the marks which & of the total 

number of girls scored. Hence our fractile in this case is 'Jf. The 


formula to be applied in equation (13) of the previous chapter is 

Jo 


Here the 70th girl falls in 35-39 group, 

Z. 7 = 35+ — x 5 


= 35 + 3*75 
= 38-75. 


Hence the minimum pass marks for the girls should be 39 to 
make the result 70% among them. 

(c) The prize committee decided to award prizes to the best 
40 candidates (irrespective of sex) judged on the basis of marks 

obtained in the test. Estimate the number of girls who would 
receive the prizes : 


Marks Candidates (Boys and Girls) Candidates (Girls) 


55-59 

5 

50-54 

10 

45-49 

50 

40-44 

45 

35-39 

30 

30-34 

20 

Below 30 

40 

200 


In order to find the lower limit 


40 candidates /. e. j, we find Z. 2 . 
date comes in the 45-49 group. 


5 

20 

30 

20 

15 

10 

100 

of the marks scored by first 
We see that the 40th candi- 


7 . 40-15 . 

Z. 2 = 45 +— jy— X5 

=45 + 2 5 


= 47*5 marks. 

Now we find m, the number of girls who scored 
47 5 marks. For this, we must have 


more than 



56 


STATISTICS 


47 5=45+' 5 2o 5 x5, 

since the sum of the frequencies in the first two classes (in 
girls’ table above) is 5, and 20 is the frequency of the 45-49 marks 
class. 

.*. w—15, 

so that the number of girls receiving the prize is 15. 

12. Define mean deviation and standard deviation. Calculate 
the mean deviation from the mean and the standard deviation of the 
series a, a+d, a+2d,. . a-\-2nd. 

Prove further that the later is greater than the former . 

(B. Sc. Agra ’63) 

The number of items in the series is 2n+l. 

S (mean)= 2 ~pj {a+a+d+a+2d+... + ...+a+2nd) 

~2n+\^T 

=a+nd. 

Mean deviation about the mean 


= 2nTT r = 0 1 (a+r</)—(a + n</) 1 

= 2 n + 1 U ^+•••+^1 

__n (n-f- 1) d 
2 n +1 ~ • 

1 2 ” 

. S {(a + rd)-(a+nd )} 2 

"J + 1 r=0 

= 2^+i- 2(i2 K+(fl-DH(«-2)»+... + i 2 } 

_ 2d 2 n («+ 1) (2n+ 1) 

2n + l* 6 

= 71 (*+l) d 2 


We have further to show that 




or (2n + 1 ) 2 > 3/i (m- f 1), 

*>. n 2 +/i 4-1 > 0, 

which is true since n > 0. 


MEASURES OF DISPERSION AND SKEWNESS 


57 


EXERCISES 

1. Find the standard deviation of the following data : 

* 1 2 3 4 5 6 

/ 2 6 12 7 2 1. 

(Ans. 1*117] 

2. Calculate the standard deviation and semi-quartile range for 
the following table giving the age distribution of 542 members 
of the House of Commons : 

Age 

(in years) 20—30 30— 40— 50— 60— 70— 80- 
No. of 

members 3 61 132 153 140 51 2 

[Ans. s. d.a 11*9 years, semi-quartile range = 9*351 years] 

3. Calculate the quartile deviation, mean deviation and standard 
deviation from the following data : 


Class intervals f 

195—199 1 

190—194 2 

185—189 4 

180—184 5 

175—179 8 

170—174 10 

165—169 6 

160—164 4 

155—1 59 4 

150—154 2 

145—149 3 

140—144 1 


[Ans. 0=8*28, MD= 5-32, s. d. = 6'68] 

4. The goals scored by two teams A and B in foot-ball season 
were as follows : 


No. of goals scored in a Match 



No. of matches 

A 

1 

B 

27 

17 

9 

9 

8 

6 

5 

5 

4 

1 1 

3 




58 


STATISTICS 


Find which team is more consistent ? [I, A. S.] 

|^Ans. 100=123-8, 100=108*9. 

.*. the team B is more consistent.J 

5. The scores of 2 golfers for 24 rounds were as follows : 

A. 74, 75, 78, 78, 72, 77, 79, 78, 81, 76, 72, 72, 77, 74, 70, 

78, 79, 80, 81, 74, 80, 75, 71,73. 

86, 84, 80, 88, 89, 85, 86, 82, 82, 79, 86, 80, 82, 76, 86, 

89, 87, 83, 80, 88, 86, 81, 84, 87. 

Find which golfer shows greater variability. 

^ ns * A/j 100=4-26, ^-=4*07. Player A shows greater variability. 

6. From the following table compute the quartile deviation as 
well as the coefficient of skewness. 


Size 

Frequency 

Size 

Frequency 

4— 8 

6 

24—28 

12 

8—12 

10 

28—32 

10 

12—16 

18 

32—36 

6 

16—20 

30 

36—40 

2 

20—24 

15 

[Ans. 

Q— 5 3, coefficient of 

skewness = 0*208] 


7. The means of two samples of sizes 50 and 100 respectively 
are 54* 1 and 50*3 and the standard deviations are 8 and 7. 
Find the mean and standard deviation of the sample of size 
150 obtained by combining the two samples. 

Ans. A/=51-57. 
a-7 55. 

8. If a range of six times the standard deviation covers at least 
18 class intervals, Sheppard’s correction will make a difference 
of less than 0*5 percent in the uncorrected value of the 
standard deviation. 

j^Hint. If a,, o 2 , are the corrected and uncorrected standard 
deviations and h the length of the class interval, then 

6 a, > 18 M.e.A<?». 



MEASURES OF DISPERSION AND SKEWNESS 


59 


Now 
so that 
or 


2 ^ H ‘ 

ff,W — 72’ 


-«'+n 0+rlz) 


a<ai ( t+ m) u - a '0+m) 


approx. 


~ a ‘ 100=^?=0'5 approx. J 



CHAPTER IV 


CONSISTENCE OF DATA AND ASSOCIATION OF 

ATTRIBUTES 

4*1. Attributes. The dictionary meaning of attribute is 
quality or property. In theory of attributes the objects are 
classified according to quality e. g. tall and short, black and 
white, healthy and sick etc. Thus all persons above a certain 
height are classified as tall and below it as short. If a relation 
exists between two or more attributes, they are associated. The 
association may be positive or negative or the attributes may be 
independent. 

4*2. Classification with reference to attributes. The presence 
of attributes is denoted by capital letters A,B,C.». and 
their absence by a, p, y ... respectively. Thus if we represent white 
complexion by A, black complexion shall be denoted by a. The 
total frequency is denoted by N and combination of attributes is 
represented by grouping the letters representing the attributes. 
Thus if A stands for white complexion and B for tallness, AB 
shall represent tall persons with white complexion, cup black 
persons with short stature and so on. (A), (B), (AP) etc. represent 
the number possessing these attributes. 

4*3. Class frequencies. A class specified by r attributes is 
known as that of rth order. Thus rV, the total frequency, is a class 
of 0 order; (A), (B) etc. are class frequencies of first order; (AB), 
(AP) etc. of second order, (ABC), (ApC )... of third order and 
so on. In all if there are k attributes, the number of class fre¬ 
quencies of rth order is *C r .2 r since r attributes can be selected out 
of k in *C r ways and each attribute can be either positive or 
negative e. g. A or a. Thus the total number of class frequen¬ 
cies is 

k 

£ *C r .2'=(l+2)». 

r=0 

= 3*. (I. A. S. *55) 

4 4. Relation between class frequencies. It can be seen that 
all the class frequencies are not independent. Thus 



CONSISTENCE OF DATA AND ASSOCIATION OF ATTRIBUTES 61 


(A) + (*) = N 
(AB)+(AP)=(A) 

(ABC)+(AB y )={AB) 

(A) = (AB)+(A(3) 

=(A BC) + (A By)+(ApC) + (A(3y). 

Thus each class frequency can be expresed in terms of 
frequencies of the highest order known as ultimate class frequencies. 
The number of ultimate class frequencies for k attributes is 2*. 

(I. A. S. ’55) 

It may be noted that the class symbols can be treated as 
operators. Thus 


A. N=(A) 

meaning that if N is dichotomised according to A, we get (A). 
Similarly <x(3.N=(aft), 

aBC. N=(y.BC). 

Since A.N=(A), 


We have 
giving 
or 


a. N=(a). 

(A + *).N=(A)+(u) = N, 
A -f-a= 1 

a=1— A . 


Thus (a/?)=(I — A) (l-B).N 

= (\—A — B-\-AB). N 
= N-(A)-{B) + (AB) 

and ( a /?y)=a/3y.W=i(l — A) (\-B) (l-C).N 

=N-(A)-(B)-(C)+(AB)+(BC)+(CA)-(ABC). 

4'5. Solved Examples. 

1. Measurements were made on a thousand husbands and a 

thousand wives. If the measurements of husbands exceed the 

measurements of wives in 789 cases for one measurement in 741 for 

another and in 690 cases for both measurements, in how many cases 

»v/// both measurements on the wife exceed the measurement on the 
husband? 


Let A represent the number of cases in which the measure¬ 
ments of husbands exceed those of wives in one measurement 
and B the number of such cases for second measurement. 

We have, 

Af= 1000, M)=789, (B)=74l, (AB) = 690 
and we are required to find (a/3). 

(*13) = N-(A)-(B) + (AB) 

= 1000 — 789 — 741 -f 690= 160. 



62 


STATISTICS 


2. Show that if A occurs in a larger proportion of cases where 
B is than where B is not , then B will occur in a larger .proportion of 
cases where A is than where A is not. (Agra M. Sc. ’48, *50) 


Given that 


Nov/ 


if 

i e. if 
i.e. if 


(AB) (Aft 

(&) (ft 9 

(AB) (Aj3) 

(ft (ft 

(B) Jft 
(AB) (Atf 

(aft (aft 

(AB) (AP) 9 
Mft (aft 
(AB) (aft’ 


we have to show that 



subtracting 1 from each side, 


i.e. if 
i.e. if 


Cab) < (Ilf)' addiDg 110 each side 

(AB) (aft 
(A) (a) * 


(aft 

(«)* 


3. In a free vote in the House of Commors, 6G0 members voted , 
300 Government members representing English constituencies (includ¬ 
ing Welsh) voted in favour of the motion. 25 opposition members 
representing Scottish Constituencies voted against the motion. The 
Government majority among those who voted was 96. 135 of the 
members voting represented Scottish constituencies. 18 Government 

members voted against the motion. 102 Scoitish members voted in 
favour of the motion. The motion was carried by 310 votes . 
Analyse the voting according to the nationality of constituencies and 
party. 

Denoting the Government members by A, opposition members 
by a, representing English constituencies by ft and Scottish consti¬ 
tuencies by ft those voting for and against the motion by C and y 
respectively. We have the following frequencies : 

AT=600. (ft4C) = 300, (a/3y)=25, (A)-( a)=96, (y) = 135, (Afi)- 18, 
(By)= 102, (ft —(ft =310. 

Now jV = (,4) -f (a)*=600, . , . , . 

and (A) — (a) = 96, - 

giving (/4) = 3 J 8, (a) = 252. 

Similarly from (ft —(ft = 310 

(ft-f (ft = 600, we get 
(ft=455, 

(ft = 145. 


and 


CONSISTENCE OF DATA AND ASSOCIATION OF ATTRIBUTES 63 


Also (C) = N-(y) 

= 600-135 = 465 
This completes all the first order frequencies. 

Now (AB)=(A)-(A/3) 

= 348-18 = 330 

and (BC)=(B)-(B y ) 

= 455-102=353. 

, A,so (*Pv)-*Wr)-(Afi y ) 

= (v) — (B y ) — {(A)-(AC)-(AB) + (ABC)} 
or 25=135 — 102 —{348 —(/4C) — 330-f 300}, 

giving (AC) = 310. 

Now (AB Y ) = (AB)-(ABC) = 30. 

Similarly (<xBC) = 53, 

(AfiC)-10, 

Wy)=(y)-(B y )=33 ; 
so that (Apy)=((3y) — (a/3y) = 8, 

(a/5)=(/?)-(^)=127 

and hence (a/3C) = 102 and (aB y ) — 72. 

4. /!/ competitive examination at which 600 graduates 
oppeared boys outnumbered girls by 96. Those qualifying for 
interview exceeded in number to those failing to qualify by 3/0. The 
number of Science graduate boys interviwed was 300 while among 
Arts graduate girls there were 25 who failed to qualify for interview. 
Altogether there were only 135 Arts graduates and 33 among them 
failed to qualify. Boys who failed to qualify numbered 18. Find (a) 
the number of boys who quilfied for interview (b) the total number 
of Science graduate boys appearing and (c) the number of Science 
graduate girls who qualified. (I. A. S. ’53, U. P. P. C. S. ’56) 

Let A stand for boys, oc for girls; B for those qualified for 
interview and ft for non-qualified for interview, C for Science and y 

for arts candidates. We have the following equations from the 
data : 


N= 600, 

(A) — (a) = 96, 

(B) -(fi) = 310, 
(ABC) = 300, 

(a/?y)=25, 
(y)= 135, 
(/*y)=33. 
{Afl)= 18, 


...(i) 
.. .(ii) 

.. .(in) 
... (iv) 



... (vii) 
... (viii) 



64 


STATISTICS 


We have to calculate (AC) and (a BC). 

Since tf=M)+(a)=(*)+(0). 

Solving (i) and (ii) f (i) and (iii), 

(<4)=348, (a)=252; 

(B)= 455, (/3) = 145; 

(AB)=(A)-(A/3) =348-18=330. 

Now (AC)=(ABC)+(A(3C) 

=(ABC)+(A(3)-(APy) < 

=(ABC)+(AP)-{(P y)-(«/?y)} 

= 300-f-18—{33—25} 

=310. 

Also (a BC) = (BC)-(ABC) 

=(C)-WC)-(ABC) 

=(C)-(p)+(Pv)~(ABC) 

=N—(y) — (P) + (Py)—(ABC) 

=600-135-145 + 33-300=53. 

5. Show that for n attributes A, B, C...M 
(ABC...M) > {A) + (B)+(C)+...(M)-(n-l) N, 

where JV is the total frequency. 

(Agra M. Sc. ’49, ’55, ’57; P. C. S. ’58) 

Since none of the class frequencies is negative 

(«0) > 0 

or N-(A)-(B) + (AB) > 0 

or (AB) > 

Writing BC in place of B t we get 

(ABC) > (A)+(BC)-N 

> {A) + (B)+(C)-N-N, 

since (BC) > (£)-f(C)— N. 

Hence (ABC) > (A)+(B)+(C)-2N. 

Now by mathematical induction method, let the formula be 
true for r attribute upto K t so that 

(ABC.. K) > (^)+(£ / +(C) + ... + (tf)+(r--l) N. 

Writing KL for K, we get 

(ABC...KL) > (,4)+(£)+(C)+... + (tfL)-(r-l) N 

> (/i)+(5)+(C) + ...+(tf)+(L)_Ar-(r-l) N 

> (,4)+(2?)+(C) + ...+(/0+(L)-rAr. 

Hence if the formula is true for r attributes, it is true for (r+1) 


CONSISTENCE OF DATA AND ASSOCIATION OF ATTRIBUTES 


65 


attributes. We have already proved it to be true for two and 
three attributes, hence it is true for 4, 5,.. .n attributes. 

4*6. Consistence of Data. The observed class frequencies 
taken for the same universe should be consistent so that no obser¬ 
vation should conflict with any other. For consistence, it is 
necessary that no class frequency should be negative. In fact 
since all frequencies can be expressed -in terms of ultimate 
frequencies, it is sufficient for consistence that all ultimate class 
frequencies are non negative. 

(i) For one attribute, we have 

(a) (A) < 0. 

(b) (A) > N (a) < 0. 

(ii) For two attributes, we have 

(a) (AB) < o otherwise (AB) is negative 

(b) (AB) < (A)+(B)-N „ (a/3) is 

(c) (AB) > (A) (AB) is 

(d) (AB) > ( B) „ («£) 

It may also be noted that in case of two attributes, 

(A) <0 (B) < 0. 

(A) > N (B) > N, 

necessarily apply. 

(iii) For three attributes, we have the following in equalities 
otherwise the frequency given on the right will be negative 


(a) 

(ABC) < 0 

(ABC) 

(b) 

< (AB)-h(AO-(A) 

(A By) 

(c) 

< (AB) + (BC)-(B) 

(vBy) 

(d) 

< (AC) +(BC) — (C) 

(a PC) 

(c) 

> (AB) 

(A By) 

(0 

> (AC) 

(ABC) 

(g) 

> (BC) 

(a BC) 

(h) 

> (AB)-{-(BC) + (AC)-(A) 




(a/3y) 


Since all three comparisons are not independent, we get the 
following four conditions, (i) is obtained by combining (a) and (/;), 

(j) by (b) and (g) and so on. Thus 

(i) ( AB) + (AC) + (BC) < (/1) + (Z?) F(C)-Ar 

(j) (AB)-\-(AC) — (BC) > (A) 

(k) ( AB)-(AC) + (BC) > (B) 

(l) (AC) + (BC)-{AB) > (C) 



66 


STATISTICS 


It may be noted that 2" independent class frequencies are 
necessary to get complete information about the other class 
frequencies. Thus given the set of positive class frequencies, all 
other class frequencies can be found, since their number is 2” and 
they are mutually independent. 

In case the data supplied are not complete, so that it may not 
be possible to find the values of all the class frequencies, it is 
possible to find *lhe limits within which the class frequencies can 
take values. 

4*7o Solved Examples. 

1. If a report gives the following frequencies as actually 
observed, show that there must be a misprint or mistake of some 
sort , and possibly the misprint consists in the dropping of 1 before 
85 given as the frequency (AB). 

N= 1000, (A)=510 , (B) = 490, (C)=427, fAB)=189 . 

(AC) = 140, (BC)=85. (I. A. S. ’49, Agra M. Sc. ’58) 

From equation (a), we have 

(. BC) < 510+490 + 427—1000— 189—140 
< 98. 

Butin the data given (#C) = 85, which is less than 98. If 
(BC) is read as 185, the data become consistent. 


2. Given that (A)=(B)=(C)=hN and that 

find what must be the greatest and least values of p in order that we 

( RC i 

may infer that ~ exceeds any given value , say q. 

From equation (i) of § 4*6, we have 

(BC) 


N 


< 2 - 1-2 p. 


i. e. 

or 

or 


< 1-2 p 
i-2y? < q 
2 P > i—q 

p > ^ (1—2 q). - 

Similarly from equation (j) of § 4’6, we have 

2 > i 


or 


N 


2 p-l > 


(BC) 
' N 


CONSISTENCE OF DATA AND ASSOCIAT ON OF ATTRIlJUTtS 


67 


or 2 p - \ < q 

or p < £ (1 + 2 < 7 ). 


3. In a v:ry hotly fought battle, 70% at least of the combatants 
lost an eye , 75% at least an ear, 80% at least a leg and 85% at least 
an arm. What percentage at least lost all four 

(Allahabad M. Com. ’49) 


If we denote the losses of an eye. an ear, a leg and an arm 
by A, B’C, D respectively, we have 



Now from example 5 page 64, 


or 
i. e. 


(ABCD) > (/0 + (£)+(C(-f (D)-3N 
> •7+-75+*8+*85-3 



Hence at least 10% lost all four. 

4. Show that if (AB) X , ( a B) x , (Ap) x , (<xp) x , and (AB) 2 , ( a B)±, 
(Aft ) 2 , (a-fi )2 be two aggregates corresponding to the same value of 
(A), (B), (a), (ft), then 

(AB) l -(AB) 2 =(*B) i -(*B) l =(Aft) 2 -(Afl) x = (<x.p) l -(<xp) 2 . 

Now (A) = (AB) l + (A,8) l . 

( A)=(AB) 2 + (Aft) 2 _ 

Subtracting, 0=(AB) x — (A B) 2 + (Ap) x — (AfJ) 2 

giving, (AB) X —(AB) 2 =(A(1) 2 — (Aft) x . etc. 

5. The following summary appears in a report on a survey 
covering 1000 fields. Scrutinize the numbers and point out if 


there be any mistake or misprint in them :— 

Manured 5/0 

Irrigated fields 490 

Fields growing improved varieties 427 

Fields both irrigated and manured 189 

Fields both manured and growing improved varieties 140 

Fields both irrigated and growing improved varieties 85 


(I. A. S. ’49) 


If we denote manured fields by A, irrigated by B and growing 
improved varieties by C, we have 

(/i) = 5IO. (/?)=• 490, C=427, (AB)= 189, (/1C)= 140, (BC)= 85, 
/V=1000. 



68 


STATISTICS 


Now for consistence 

{AB)+{AC)+(BC) < (A)+(B)+(C)-N 
or 189+140+85 < 510+490+427-100 

or 414 < 427, 

which is incorrect. Hence the given data are inconsistent. 


6. In a village actually involved by anthrax , 70% of the goats 
were attacked and 85% have been inoculated with vaccine. What is 
the lowest percentage of the inoculated goats that must hdve been 
attacked. (I. A. S. ’55) 


Denoting the attribute of attacked 
by B, we have 


(A)='1N, 


goats by A and inoculated 


(B)=-&5N. 

(«/?)=AT-(^)-(fi)+ (AB) 

> 0 , 

(AB) > (A)+(B)-N, 

> C7+-85) N-N 

> *65tf, 

so that atleast 65% of the inoculated goats must have been attacked 
by anthrax. 


Since 

we have 
i. e. 
or 


7. A social survey in a village revealed that there were more 
uneducated males than educated ones, there were more educated 
employed males than uneducated unemployed males. There were 
more educated unemployed under 35 years of age than employed 
uneducated males over 35 years of age. Show that there are more 
uneducated employed males under 35 years of age than educated 
unemployed males over 35 years of age. (B. Sc. Agra ’60) 

Denote the attributes as follows :— 

A for educated males, B for employed and C for those under 
35 years of age. 

We have 



(a) > (A), 



(AB) > (a/?). 

...(2) 

and 

(A(3C) > (ttjSy). 

...(3) 


Adding (1) and (2), we get 



(a) + (AB) > (A)+(«(3) 


or 

(a)-(a/3) > (A)—(AB) 


or 

(*B) > ( A(3). 

...(4) 



CONSISTENCE OF DATA AND ASSOCIATION OF ATTRIBUTES 69 


Adding (3) and (4) and transposing, we get 

(afl)-(affy) > (Ap) — {APC) 
or («BC) > (A/3 y ), 

which was to be proved. 


8. 50 percent of the imports of barley into a country come 

from Dominions ; 80 percent of the total imports go to brewing, 
75 percent of the imports are grown in the Northern Hemisphere , 
80 percent of the Northern-grown barley goes to brewing , 100 
percent of foreign southern grown barley goes to stock-feeding. 
S*iow that the foreign Northern-grown barley which goes to brewing 
cannot be less than 30 percent nor more than 60 percent of the total 
imports. (It is assumed that brewing and stock-feeding are the 
only two uses to which imported barley is put.) (M. Sc. Agra ’54) 

Denoting, the total imports by N, barley from Dominions 
by A, barley going to brewing by B and birley grown in Northern 
Hemisphere by C, we have 


(A) = '5N, 

• - - (i) 

(B) = '$N, 

. . .(ii) 

{C) = '15N, 

.. .(iii) 

•8 ( C) = (BC ). 

...(iv) 

(a/3y) = (ay). 

..-(v) 


To show that 

-3N < ( aBC) < -6 N. 
From (iii) and (iv), 

(BC) = -6N. 

Also (a BC) < ( BC). 

• e. ( <x.BC) < *6 N. 

From (v), we have 


(ay) - (oc/3y) = (a fly) = 0. 

Now (<xBC) ■+- (olBY) = (aB), 

so that (*BC) = (uB) 


Now 

(Aft) 

> 

0 

or 

( 1 —a) (1 — B) N 

> 

0 

or 

0 0 

N-(«)-(B) + (<xB) 

> 

0. 

giving 

(uB) 

> 

(a)+(Z?)-N 

or 


> 

(*5-h 8) N-N 


> -3 N. 

Equations (vi), (vii) and (viii) give 

-2N < (aBC) < •( N. 



• • 



.. .(viii) 



70 


STATISTICS 


9. Show that if ~^~ x > ~Jf == ^ x » 


and 


(AB)__(AC) __(BC) 


N N N 
then the value of neither x nor y can exceed 

We have 


=y. 


(B. A. Punjab *63) 


or 

Also 

Hence 

or 

or 

Hence 


(BC) < (£)+(C)-A 
y < 5*—1. 

(AB) > (A) or y > x. 

x < 5x — 1 
1 < 4x 

x > h 

y > since y > x. 


10. Given that (A) = (B) =.(C) = \N and 80 percent of the A’s 
are B’s, 75 percent of the A's are C's , find the limits to the percen- 
tage of B's that are C’s. 


Given 

or 

and 


w »_. 8 

M) - 8 

(AB) = -4N, since (A)=$N. 

*75 

(AC) = d2 Nm 


From equations (i), (j), (k), (1) of § 4-6 page 65, we have 



2 (BC) 

N < 


1 — *8 —*75. 


(b) < -8 + -75-1. 

(c) > 1--8 + -75. 

(d) > 1+-8--75. 


From (b) and (c), we have 

(BC) (BC) 

A/2 < 55 ' A/2 > 95, 

i. e. not less than 55% and not more than 95% of the B's 
are C’s. 


11. 100 children took three examinations. 40 passed the first 

39 passed the second and 48 passed the third 10 passed all three , 9 
passed first two and failed in third, 19 failed the first two and passed 
the third. Find how many children passed at least two examina¬ 
tions. Show that for the question asked certain of the given frequen¬ 
cies are not necessary. Which are they ? (P. C. S. ’52) 



CONSISTENCE OF DATA AND ASSOCIATION OF ATTRIBUTES 7 1 


(A) = 40, (Z?) = 39, (C) = 48, (ABC)** 10, (AB y) = 9 

(o.ftC)= 19. A=100. 

We have to find the no. of children who passed at least two 
examinations, i. e. (ABC)+(ABy)+(ApC) + (oiBC) of which we 
know (ABC) and (ABy) 

Now (C) — (AC) -\-(y.C) 

= (ABC)-\-(AftC) + (a.BC)+(*ftC) 
or 48=(/lfiC) + (/4/'3C)H-(atfC)-i-19 

or (ABC)A-(AflC) + (*BC) + (AB y )=4S- 19 f 9 = 38. 

It is to be noted that we require only (C), (<xftC) and (ABy); 
the other frequencies are not necessary. 

Again N-(<tftC) -(*fty) = (A) + (B) — (ABC) — (AB y ) is the 
iinear relation existing among the given frequencies. Hence all 
the eight frequencies are not independent here. 

4 * 8 . Explain the terms independence and association as applied 
to attributes. (B. Sc. Agra ’61, M. Sc. Agra *53) 

When two attributes are said to be associated ? 

(B. A. Vikram *69) 


If two attributes A and B are independent, we would expect 
the same proportion of A's among B 's as among ft's. Thus for 
independence 


or 

or 

and 


(A B) = (A /?) = (AB)A-( Aft) = (A) 
(B) (ft) kB)A-(B) N 


(Aft) 

(ft) 

(AB) 

(B) 


(AB)= - 

(AB) 

N 


(B)+(ft) 

(A) 

N 

(A) (B) 

N 

(A) (B) 

N * N' 


If the attributes A and B are independent, the proportion 
AB’s in the population is equal to the product of the proportions 
A’s and B’s in the population. 



(AB) > 


(A) (B) 
N ’ 


of 

of 


then A and B are positively associated while if 

(A) (B) 

N ’ 


(AB) < 


A and B arc said to be negatively associated. If A and B are 
positively associated, they would appear together in a larger 
number of cases than if they had been independent. 



72 


STATISTICS 


4*9. Yule’s coefficient of association. In order to measure 
the intensity of association between two attributes, G. Undy Yule 
gave a simple coefficient of association, 


n_ ( A B) (<x ft) — (AP) (*B) 
(AB) + (<xB) 

m 


where 


(AB) (a P) + {A0) (a Bf 
(A) (B) 


S = (AB)- 


N 


or 


or 


or 


For independence, we have 

(AB)^(A£) 

(B) (ft) 

(AB) __ (AP) 
(B)-(AB) (P)-(AP) 
(AB)^(AP) 

(a.B) (a/3) 

(AB) (a.P)=(AP) (x B). 
(A) (B) 


Also h = (AB) — 


1 


N 


= JjU AB ) i(A B) + (a.B) + (Afl) + (aft)} 


] -{( AB )+(AP)} {(*B) + (AB))] 

= ~ n {(AB) («£)-(«*) (AP)}. 


(M, Sc. Agra ’52, B. Sc. Agra ’55) 
If the attributes A and B are independent, Q = 0. If Q > 0 
the attributes are said to be positively associated while a negative 
value of Q indicates negative association. If Q=l f there is a 
complete association between the attributes and if Q = — ], the 
attributes are completely disassociated. Since all quantities in Q 
are positive, 

-1 < 0 < I. 

If 8 = 

A' * 


so that 


— S = (ai?) — 


S = (Af1)- 


(<x)(B I 
N ' 
(A) (Pi 
TV * 





CONSISTENCY OF DATA AND ASSOCIATION OF ATTRIBUTES 73 

We see that 

(« B) — ={B) — (AB) — [ N—iW iB) 

- 

Similarly we can prove that (A8) ^ — s 

N ~ 

Also (B)} 

= -(afi) + ( ^> = 8. 

4*10. Solved Examples. 

1 . Show that if h = (AB)- ( -*±JJL\ t h en 

( A B ) 2 + (aP)*—( a .B)*—(A(j ) 2 

= [( A )-(*)) l(B)-(0)]+2m. 

I H Q tA d \2 / «. (M. Sc. Agra ’60) 

L.H.S.— (AB) 2 —(a-B)*+ (*($)*—(A p) 2 

{MB >-< aB » + {(«/3) + MW} {< a p)-(AP)j 
— (B) {(AB) — <a .B)} + (ft) {(a P)-(Aft)} 

-(*> { S+ WW£>_(«)Jfi +s j 

+(/3) { 6+ m_mm^ +s j 

= 2S f (B) + (P)) + ~ ((B)‘-(/3)'}—Q {(B)-—(PS) 

=2N8 + ({(A)-(<x)) {(B)-(P)}]. 

2. Criticize the following statements : 

oJoia t ed Near ' y a " A ' SareD ' S and »*"/«• mus.be 

mllltart '!‘ e n ’ embers who fated for army estimates were 

TsZffr ‘" eref0re " “*/-*• '* «PP0» <ha, the voting 

( ocy> (M. Sc. Agra ’53) 

year l C Jr n ,he Pe ° pIe Wh ° drink beer die before reaching 100 
J age. Therefore drinking beer is bad for longevity. 

(I. A. S. ’48; M. Sc. Agra ’59, ’61; U. P. P. C. S ’57) 

(a) The inference is not justified. In order to ascertain 

(a/?> 


whether A 3nd B are associated we should also know 


(«) ’ 


i. e . 



74 


STATISTICS 


proportion of B's among a’s. It is quite possible that while all A *s 
are B's, all a’s may also be B's. 

(b) Unless we know the composition of the whole population 
or those members who voted against army estimates, the inference 
is not correct. It is possible that those who voted against the army 
estimates may be of the reverse type. 

(c) For reasons given in (a) and (b) in this case too the 
inference is not correct. It is possible that 100% of the persons 
who do not drink beer may die before reaching 100 years of age in 
which case drinking may be found to be good for longevity. 

Therefore for association between A and B, in addition to —it 

(A) 

(aff) (A) 

is necessary to know V*- or -r^r. 

(a) N 

3. State as briefly as possible whether the attributes A and B 
ore positively associated, negatively associated or independent in the 
following cases: 


d) 

no 

(Hi) 

(i) 


N= 1000, 
(A) = 490, 

( A B)=256, 


(A) = 470, 
(AB) = 294 t 
(clB) = 768, 


( B) = 620, ( AB)=320. 

(a) =570, (*B)=380. 

(Aft) =48, (<x.P) = 144. 

(U. P. P. C. S. ’61) 


= 291. 

(A) ( B) 
N 


He nee A and B are 


(A) (B) _470x 620 
N~ 1000 

In this case (/lfi) = 320 ; 

positively associated. 

(ii) Proportion of B's in A ’s=~ 

(A) 490 5 

Proportion of B's in «’s= ( -^ ) =p?=?. 

(a) 570 3 

T . (AB) (a B) , , „ 

Hcre {A) < (a) ’ hence A and B are ne gatively associated. 

(iii) (AB) ( a ,0)-(/1.3) (a£) = 256x 144-48 x 768 

= 0 . 

A and B are independent since Q is zero in this case. 

4. Given that 

(A) = {<x) = {B) = (0) = {C) = ( y ) = \N 

and also (ABC) = {zfiy). 

Show that 


2 (ABC) = (AB) + (AC) + (BC)-\N 


CONSIST?N'CE OF DATA AND ASSOCIATION OF A7TTIBUTES 75 


( xpy) = N-(A)-(B)-(C) + (AB)+(AC) + (BC)-(ABC) 
or ’ (ABC)=N-%N+(AB)+(AC)+(BC)-(ABC) 

giving 2 ( ABC) = {AB) + (AC) + {BC)-\N. 

5. In a war between red and white forces, there are wore 
red soldiers than white, there are wore armed white soldiers than 
unarmed reds, there are fewer armed reds with ammunition th in 
unarmed whites without ammunition. Show that there are more 
armed reds without ammunition than unarmed whites with 
ammunition. (Agra B Sc. 1961) 

white A armed B with ammunition C 

red a unarmed p without ammunition y. 

With the above notation, we get the following inequalities :— 


..(i) 
. .(ii) 
. (iii) 


from (ii). 


or 


(a) > (A) t 
(AB) > (a/3), 

(aBC) < (A(3 V ). 

We have to show that (aZ?y) > ( ApC ). 

From (i), (a/?) + (a/3) > (AB)+(A(3) 

> (zl/3) + (ot/3) 

Hence (a B) > (Ap) 

(a BC) -f (a£ y ) > (ApC)A-(Apy) 

> (AGC) + {<xBC) from (iii) 

giving (a#y) > (ApC), the required result. 

6. An investigation was carried to determine whether there 
was any association between the eye-colour of parents and eye- 
colour of children. The eye-colours were noted in the case of a 
random sample of 1000 fathers and their eldest sons In 471 cases 
both fathers and sons had light eyes, in 230 cases both had dark eyes 
and in 148 cases the fathers were dark-eyed and sons light-eyed 
and in all the remaining cases the sons were dark-eyed and the 
fathers were light-eyed Determine whether eye-colour in fathers 
and in sons is associated or independent. 

(Agra B. Sc. ’63, M. A. ’44) 
A represent fathers with light eyes 
B ii sons ,, ,, ii 

fathers ,, dark eyes 


Let 


99 


99 


sons 


99 99 


99 


Then we have the association table 

B 

A 471 

* 148 


P 

151 

230 



76 


STATISTICS 


Coefficient of association 

q_ (AB) (<x/3)-(aB) (A /3) 

(AB. + (Aft 

_47i x230— 148x 15! 

471 x 230+ 148 x 151 

_108330—22348 

108330+22318 
_ 85982 

= 130678 
=•65. 

Since Q is positive, we conclude that there is an association 
between the eye-colours of fathers and their sons. 

EXERCISES 

1. If, in an urban district 817 per thousand of the women 
between 20 and 25 years of age were returned as “occupied*’ 
at a census, and 263 per thousand as married or widowed. 
What is the lowest proportion per thousand of the married 
or widowed that must have been occupied ? 

[Ans. 304 per thousand] 

2. The following are the proportions per 1000 of girls observed 

for certain classes of defects amongst a number of school- 
children :— 

A — development defects, 2?=nerve signs, C=mental dullness 
#=1000, (A)= 68, (Z?) = 85, (C) = 69, (AB) =55, ( BC) = 36 . 

Show that some defectively developed girls are dull and 
state how many at least must be so. [Ans. 6 girls] 

3. A market investigator returns the following data :— 

Of 1000 people consulted, 811 liked chocolates, 752 liked 
toffe and 418 liked boiled sweet3; 570 liked chocolates and 
toffee, 356 liked chocolates and boiled sweets ; and 348 liked 
toffee and boiled sweets ; 297 liked all three. Show that 
this information as it stands must be incorrect. 



_ . (Agra M. Sc. 1956) 

In an anti-malarial campaign in a certain area, quinine was 

administered to 812 persons out of a total of 3248. The 
number of fever cases is shown below :— 

Treatment Fever No fever 

Quinine 20 792 

No quinine 220 2216 

Discuss the usefulness of quinine in checking malaria. 

(Quinine is effective in checking malaria ) 


CONSISTENCE OF DATA AND ASSOCIATION OF ATTRIBUTES 77 

5. The male population in U. P. is 250 lakhs. The number of 

literate males is 20 lakhs and the total number of male crimi¬ 
nals is 26 thousands. The number of literate male criminals 
is 2 thousands. Do you find any association between 
literacy and criminality ? (Agra M. Sc. 1951) 

(Literacy and criminality are positively associated ) 

6. A penny is tossed three times and the results, heads and tails 
noted. The process is continued until there are 100 sets of 
threes In 69 cases, heads fell first, in 49 cases heads fell 
second and in 21 cases heads fell third. In 33 cases, heads 
fell .both first and second, and in 21 cases heads fell both 
second and third. Show that there must have been at least 
five occasions on which heads fell three times and that there 
could not have been more than 15 occasions on which tails 
tell three times, though there need not have been any. 



CHAPTER V 


FINITE DIFFERENCES AND INTERPOLATION 


5*1. Suppose we have a function y=u x where x can take values 
a, a+h, a + 2Ji, a+3/i,..the corresponding values of y will be 

Ua+ht hf • • • 

In the tabular form we write 

First Second Third Fourth 
X ^ differences differences differences differences 


a 

Ua 

Au a 


a+h 

Ua+h 

A*Uj 




Au a+h 

A 3 u a 

a+-2lt 

Ua+2h 

.n N 

A-u a+h 




A U a +oh 

A 3 u a+h 

a + 3h 

w a+3/i 

A ? U a +2h 




dW a + Zh 


a -f-4/i 

u a+ih 



where 


du a = u a +h U a \ Au a +} i =ll a +2h 

U a+h etc. 


In general A u +n- l/i — lJ a+nh— u a +n~\h . 



A 7 u a =A u a+h —A u a =(u a+ h - « a+ „ ) —(« a+ft - u a ) = w a+?v —2u aM 4- u a , 

A~U a .yi , = d U a+ 2h — d Ua+t, — ( "a+Zh “ ^a+h) ~( w a+2/> — u a+/i) 

— w a+3/» “ 2t/ a+2A -f- W a +A 


and so on. 

Similarly d 3 w a =d 2 w a+ft —d 2 « a , 

^ 3w a+A — ^ Zw a+SA — A ‘ U a + h 
and A 4 u a = A 3 u a+h —A s u a . 


5*2. Some Nomenclatures. The independent variable x is 
known as argument, the corresponding value of y, the entry, and 
a table of this form as a difference table. The first term in the 
entry, i. e. u a is called the leading term and the teims at the 
top of difference columns Au a , A'U a , A 3 u a ... are the leading 
differences. 

Caution. It may be noted that A is not a quantity but repre¬ 
sents an operation, and A 2 does not represent the square of A but 
the operation of differences hiving been done twice. Thus 


FINITE DIFFERENCES AND INTERPOLATION 


79 


d 2 u a =Au a+h -A a 

= ( u a+2h u a+h) — ( W a+A — U a )» 

Similarly j 3 means that the difference operation has 
been done thrice and so on. 

Example 1. Let us consider a numerical example of the 
case of cubes of natural numbers. 


X 

y 

dy 

A~y 

A 3 y 

1 

l 

7 



2 

8 

19 

12 

6 

3 

27 

37 

18 

6 

4 

64 

61 

24 

6 

5 

125 

91 

30 


6 

216 





Any further value of y can te found with the help of the 
difference table by extending the columns further. Thus if we 
wish to find 7 a with the help of this table, the last terms under 
the columns A 9 , zl 2 , A and y respectively will be 

6, 36, 127, 343, 

giving 7 3 = 343. 

Example 2. Find the sixth term of the series 

8, 12, 19, 29, 42, ... 


We have 

x 

1 


y 

8 


2 12 

3 19 




4 

7 

10 

13 


3 

3 

3 


5 42 

The terms in the columns d\ A and y are 3, 16, 58 

vely. 


respecti 


Hence the sixth term is 58. 



80 


STATISTICS 


5*3. E and A notation. If we denote u a+h by Eu a , u a + 2 * by E 2 u a 
and in general u a+nh by E n u a keeping in mind the precaution 
that E is a symbol and the subscripts over E do not represent 
power indices. We have 

Au a =u a+h -u a 
—Eu a —u a 
£w 0 =dw a -fw 0 . 

For brevity and representing a relation between these opera¬ 
tors we write the above results as 

A=E- 1 

or E~A -f-1. 

It may be noted that 

A {/(*)}=£ {/(*)}-/(*) 

=/(*+/*)-/(*) 

and not E {/(*)}—1. Also the relation between the symbols as 
above is an identity, but when we introduce functions in them, they 
are written as equation with the sign equal to (=) between them. 

5*4. Solved Examples. 

1. If ttx is a polynomial of degree n in x, then A n u x is a 
constant and A n+l u x is zero. Conversely , / the (n+1) th difference 
is zero , then the polynomial is not of more than degree n. 

(B. Sc. Agra *59) 

Let u x =ax n + bx n -'+cx n ~* ... -f/x+m, 

Au x =a {(*+/»)"-*«}+& {(x + h ) n -i 

+1 {{x+h)-x}+m-m 
=anhx "- 1 -f b'x n ~ 2 -f c'x n ~ 2 + 

Similarly d 2 u x = an (n- 1) h 2 x n ~ 2 +b'x n ~*+ ... -f k” 
and similiarly A n u x =an (n— 1) («—2).. .2.1 ,h n =a.n 1 h n 

=constant. 

Evidently J n+1 «a,=0. 

The converse can be proved in a similar manner. 

Note. If n is a positive integer, 

E n u x = u x+nh — (1 + A ) n u x 

=u x + n C x .Au x + n C 2 .A*u x +... -f A n u x 
and J n u x =\E— l) n u x =*E n u £ — n C l .E a - 1 u x + n C t .E n -*u a +... 

+ (-l) r . n C r .E''-ru x +...+(-.\)'>u a 

== w *+n/|— n C l U x + n _\h + n Ca U x+n^2h —•••+(— l> n W,. 

2. u x is a polynomial in x, the following values of which are 
known : u 2 = u 3 = 27; u t = 78 / u s = 169. Find the function u x . 


FINITE DIFFERENCES AND INTERPOLATION 


81 


X 

u x 


J2 

2 

27 

0 


3 

27 

51 

51 

4 

78 

91 

40 

5 

169 




We see that J 3 is a constant; hence u x is a cubic function. 

Let K»=ax*+ bx 2 + cx + d, 

so that 27 = a.2 3 +A.2 s +c.2+</, 

27=a.3 3 + A.3 2 +c.3+rf. 

78 = a.4 3 + A.4 2 +c.4+</. 

\69=a.S 3 + b.5 z +c.5 + d. 

Solving these equations, we get 

a=_V, b =42, c=-H 61 . <7=224, 

so that (-llx 3 + 252x*-1051x+1344). 

3. Find the nth difference of e x . 

We have Je a ’=e* +A — e x =e* (e n — 1), 

J 2 e« = (e*-1) ( e *+*_ e *) = (^— l) 2 e» 

and similarly, Jv = (e A -l)'* e*. 

4. Evaluate : 

(i) Jr x 3 , £x 3 • (B. Sc. Agra ’59) 

(i) J? x 3 =J 2 £-*x 3 = 4 2 ( x-h ) 3 

= J [J (x-A) 3 J = J [x 3 -(x—/l)*J 
= J (3x 2 A — 3xA 2 + A 3 J 
= 3 h {(x+A) 2 —x 2 } —3A 2 (x + A-x) 

= 3A (2xA+A 2 )—3A 8 . 

= 6xA*. 

.. j*x 3 _j [jx 3 ]_j r(^+^) 3 - x3 i 

(,1) £x 3 (x + A) 3 (x+A) 3 

= J (3x 2 A + 3xA 2 +A 3 J 
(x+A) 3 

_ 3A {(s + A) 2 -x 2 } + 3A* (x + A —x) 

* ( x + '») 8 

6xA 2 +6A 3 = 6A 2 
“ (x+A) 8 (x+A)* * 

where the interval of differencing is A. 




82 


STATISTICS 


5. Find the first difference of x 2 —5x+6, the interval of diffe¬ 
rencing being 1. (B. Sc. Agra ’55) 

u x =x 2 — 5*4-6, 

Au x =(x+ l) 2 -x 2 -{5 (x+\)-5x} 

= 2 *- 4 . 


6. Show that u A — u Q -f 4 A w 0 -f 6d 2 «_, -f-1Od*M_, as far as third 
differences. 

w 4 = £' 4 w 0 =(l4-j) 1 t/ 0 

=(l4-4d + 6d 2 4-4d 3 ) z/ 0 upto third differences 
= 1 -f 4dw 0 4-6d 2 £u_ 1 -Md 3 #/., 

= l+4ji/ 0 -f 6d 2 (1-f A) w_! + 4d 3 (1+J) 

= I+ 4zJi/ 0 +6d ? w_ 1 -f 10d 3 w_ t upto third differences. 

7. Evaluate 


(i) d 3 1(7-*; (7-2*; (7-5*;]. (B. a. PuDjab ’54) 

(ii) d 10 [(1-ax) (1-bx 2 ) (1-cx*) (1-dx*)). 

(B. A. Punjab ’61) 

..... At [~a* x +a* x 1 

(,,l) A L 

(i) The given expression is a polynomial of third degree. Hence 
its third difference according to Ex. 1 § 5*4 page 80 if a is the 
coefficient of the highest degree term. 

/. d 3 w x = —6.(3 !)= — 36. 

(ii) From what has been said in (i), we have 

A 10 u x =abcd.( 10 !). 

(iii) Aa- x —a 2 <a+1) — a 2x , the interval of differencing being 1 

= d 2x (a 2 - 1). 

In general A n a mx =a mx (a m — J) n . 


ra 2x + a lx l 1 

Hencc a2 [(^T?^v*.fd^] 


(a 2 -l) a + fl 4 » (<3 4 — 1 ) 2 J 

= [a 2x +a ix (a 3 4-1) 8 ]. 


8. lfu x -sin x, show that A 2 u x =kEu x where k is a constant . 

We know d=£—I. 

Hence A 2 u x =(E- 1 ) 2 u x 

= E 2 u t —?Eu x +u x 

= sin (^c-I- 2//) — 2£7//^4- sin x 

= 2 sin (*4 -h) cos h-2 sin (x+h) 

= k sin (x+h) where k = 2 (cos h—\) 

— k£u x . 


FINITB DIFFERENCES AND INTERPOLATION 


83 


9. Prove that A (tan~ l x) = tan ~ 1 where h is the 

interval of differencing. (B. A. Punjab ’57) 

A tan” 1 jc=tan -1 (x -\~h) — tan" 1 x 

h 


= tan -1 


=tan~ J 


1+ (* + /!)..* 
h 


l+x/i + x 2 * 

10. Obtain the function whose first differ nee is 

x 8 + 3x 2 + 5;c4-12. 

Clearly the function is a polynomial of fourth degree. Let it be 

Ax* + Bx*+Cx 2 +Dx + E ; 

then A {(x+1) 4 -* 1 }-M{(*+D 3 -* 3 }-1- c {*-H) 2 -x 2 } 

+ D {(jc-H 1)— x\ 

=x 3 -b3* E -f-5x-l- 12. 

Simplifying and comparing the coefilcients on both sides, 

A = \, B= 2, C=f, D= 12. 

Hence the required function is 

i* 4 +2x 3 -K**+12x-|-£ 

where £ is a constant. 

11. Show that 

... 4-Wn=- n+1 C 1 « 0 -b n+1 C 4 ju 0 -f-'* + ‘C 3 d 2 w 0 -|- ... +d n w 0 . 

(B. A. Punjab ’53) 

The L. H. S. is 

W 0 f- M l + W 2+ . • • +W/. = W 0 f £Wj+ £*W 0 + ... +E n u 0 

= (l+£+£ 2 +... + £") w 0 
£"+i-l 


£-1 


u. 


_(l + d. n+ i-l 
~ J Mj 

= -i {|+"-HC 1 d-|- nfl C 2 d 2 -l-...+^ n+1 -l} «0 
«=R. H. S. 

12. Use the method of separation of symbol ? to prove the 
following identities : 

(i) u l x + u i x t + u 3 x 3 + . . . 


x ' X z AUi+(i _ x y ... 




(Agra M. Sc. ’59) 



STATISTICS 


(ii) U z — U x - l -\-Au x - 2 -\-A*U x _ 3 + • • . +^ n-1 Wx- n + A n Ux-n» 


(Hi) u 0 + V fj+ l ff+ l ff + 


• • • 


=«* [u 0 +AT A u 0 +AX +.... J. 

(Agra M. Sc. ’55, B. A. Vikram ’60) 
(iv) +"C iWse _ 2 -f... 

W X ~ W*+j + W I+2 — W x +3 H" . . . 

= i , J W »-l/2~8^ 2 Wx-3/fi + 2 1 ^ (V 




3 l - (s) 3 ^ 6w »-7/2+ • • 


+ £i£^x+2i JW#-# 


(i) The L.H. S. = w J x-f-w 2 x 2 +w 3 x 8 -{-... 

=Eu 0 .x+E s u 0 .x*+E 3 u 0 x*+ .... 

= (*£+x 2 £ a + x 3 £ 3 +...) y 0 
xE 


\—xE 


u, 


XU 



l-x 


= JL.{ X ,?±, , 1 

l-*l + l-x 


l-x 




(ii) The R. H. S.— +A i u Xmm9 -\-.,, 

= f“ , Mx + J^*M I +J 8 £“3«/ r +. . 

= (£-*+d£- a +J*£-3 w , + ... 
£-» 

■_i tt x 


l-AE' 

__ 1 
£-J W ® 

= «* since £—J=I. 


FINITE DIFFERENCES AND INTERPOLATION 


85 


(iii) The L. H. S.=„ 0+ttl x + 1^+"4" + .. 

=u 0 +xEu 0 + E 2 u 0 +... 

=(l+^+^f 

_xE x(l + J) 

=e u n =e v ^ 1 u n 


’o 

x xJ 
= e .e Uq 




x 2 J a , xW , I 
• ~ f 4“ • • • I ^o # 


2! ' 3 

=e*[l+xAu 0 +f [ A*u 0 + £A 3 u 0 +....l 

(iv) The R. H. S. 

= w ar” n Ci^x—l4" n C2^aj—2“# • • 

= w x - n C 1 £-* Wa 4- n C 2 £- 2 M;B -... 

=( 1 — n C l E~ l -\- n C M E~ 9 — ...) 

= (1 — £- 1 ) n w z 
(£■)" “* 

={E— 1 )" E~ n u x =A n u. 

(v) The R. H. S. 


* / 2 m,- 1 J *E-*' 2 u x -f- ~~ (£) 2 

(X)3 ^£-2/ ?Wx+ ...j 

\E- 1/^1-1J2£-- I+ I_i 3 (J)a J4£U,_1^5 ( 1 )# ja£ ^ +>-> J ^ 

4 J 


1^3.5 
3 I 


-l/* 

* IMT - 

i [ w l£ ;"T 




=[(£+D*r* /2 w, 

c= £~Pl ifr-(l-E+£ z -£*+...) u* 

== w* — I/ x+1 -+- W x+ 2 — tt* +3 -f- . . . . 

(vi) The R. H. S. 



86 


STATISTICS 


- Wn +^£-l W „+^±I ) Ji£-^, + * X ?t 3 y 3C+2) ... 

=(l +*J£-*+^±i> j3 £ -8 + 


rE-A\- 
\ E ) Un 

=E- X u n =u n - X , since E—A—\. 

13. Use the method of finite differences to sum the following 
series: 

(i) 73+2»+3 8 -K..-M 3 . 

(ii) x*+l (x+lp+L (x+2)’+L (x+3)*+... 

(Punjab *52) 

(Hi) l*+2*+3*+...+> i*. 

(i) If we denote l 3 by u 0t 2 3 by u t and so on, the given 
series is 

u a H" u i + u z *f* •. • + u n-i 

=(l+E+E i +... +E n ~ l ) */„ 

E n —\ (l+j)«_i 

“ £ _ | U Q - Wg 

= 1 +" ( ^ (*- P, x , 2 


_L /y (”—1) (*—2) (n- 3)„ £ 

4l X ° 


rr (w + \f 
4 


[V J« 0 =2 3 -P=7; J 3 m 0 =(3 3 -2.2 3 + 1 3 )=12, 

A=(4 3 -3.3 3 +3.2 3 -l 3 )=6, 
the higher differences being OJ. 

(ii) If we denote (x+n) a by u nt we have to sum : 

w o + 2 Wl_ ^2 a W2 "*"2 5 W3 '^* * * 

=( 1 +|£> + l i £>+!;£•+...)»„ 


1 

1 -\E 


W 0 


FINITE DIFFERENCES AND INTERPOLATION 


87 


_ 2 2 

2-E U °~ I'—J w ° 

=2 (l-J)-i// 0 
= 2 (1-f A 2 -{-.,.) Uq 

= 2 [x*+Ax x +A*x % \ 

=2 [* 2 + 2*-f 1+2J 

= 2 ( x*+ 2 x+ 3 ). 

(iii) Proceed as in (i). 

AnS * ^ (6'< 4 + 15 /i 3 -H0/7 2 -1). 


14. Me vo/we 0 f 

Ax m —jA*x m + ~ A i x m ~ 1 ' 3 ' 5 A*x m I „ , 

2.4 2.4.6 n X ' ' • - to m terms 

c . . .. (Vikram ’61) 

Smce A m x m and higher differences of a™ are zero, the sum 
of the series is the same as the sum to infinity. 

The given expression = A (\ — + * v? A 2 — 1 ja. \ 

V ' 2.4 2.4.6 -r...jx 

= A (1 -f J)~l/2 
=J£- 1/2 x m 
(x—*) m 

=(*+*) m —( a :—$) m 

where the interval of differences is unity. 

15. Prove that w 4 = w a + Au 2 +A’^ + A 3 ^. (D. A. Punjab ’61) 

We have, u 4 —u 3 = Au 3 =A ( u 2 +Au 2 ) 

=Au 2 +A 2 u 2 =Au 2 +A 2 (u t +Au x ) 

=Au 2 +A*u l + A a u l . 

16. Given u 0 =3 , u x = 12, u 2 =81 , w a = 200, i# 4 = /00, „ 5 =tf, 

find A B u 0 . {Agra B Sc> , 59) 

A=E— 1. 

A 6 u 0 =(E- l) 6 « 0 

— E b u 0 —5E*u 0 +\OE a u 0 -l OE*u 0 + 5Eu 0 - u 0 
= w 6 —5w 4 -f 10 m 3 — 10 u 2 +5u 1 — u 0 . 

On substituting the given values, we have 

J 6 w 0 = 755. 

17. Find u 6 , given u 0 =—3 t u l =6, u 2 =8, u 3 ^J2, third differen¬ 
ces being constant . 

Since the third differences are constant, the fourth and 
subsequent differences are all zero. 



88 


STATISTICS 


w 6 =£*« 0 =(l+J)‘w 0 

= l + 5Jw 0 +15d*w 0 +20d 3 w 0 

= 1+6 ( Wi _m 0 )+15 (w 2 —2m 1 +i/ 0 )+20 (t/ 3 ~3M a 4-?«i-Wo) 
= l + (6x9)+15 (-7)+20 (12-24+18+3) 

= 126. 

5*5. Factorial Notation. For certain purposes, we define 
the product of n factors beginning with x and decreasing with a 
finite difference. 

x (n, =x (x—h) (x—2h)...(x—n — \ h), 

Ax' n >=(x+h) (x) (x-h) (x-2h)...(x-n^2 h) 

—x (x—h) (x—2h).. .(x—n— 1 h) 
=x (x-h) (x—2h)...(x—n^2 h) {(x+/i)—(x—n—1 h)} 
=x (x—h) (x—2h). ..(x—n—2 h) nh 
=nhx {n ~ 1) . (Agra B. Sc. ’55) 

Similarly, 

d a x<">=n (n-1) h 2 x' n -*\ (Agra B. Sc. ’56) 

so that A n x in) =n ! h n . . 


If /i = l, 

and since x ln) =(x— n+1) x (n - ,, t h= I, 

*«»=(*+U 

By convention, x (0, = l, 


so that 


giving 


y(-d _ 

x+\ 

=(x+2) x<- !) 
*<-*>= 1 


(•*+!) (x+2f 

The argument can be extended to prove that 

1 




(x+l) (x+2)...(x+n) 
1 

(x+n) ,n > * 


5'6. Solved Examples. 

1. Find the relation between <x, (3, y in order that a +/?x+yX 3 
may be expressible in one term in the factorial notation. 

(Agra B. Sc. ’61) 

Let a+/?x+yx 2 =y (X— h) (x—h— 1) 

= yx--yx (2h +1) + yh (A + l) 
giving P=y (\ + 2/;), 

a= y /i (/i- \- 1). 


FINITE DIFFERENCES AND INTERPOLATION 


89 


Eliminating h between these two equations, 



so that a= 0 

giving y*4-4a y = /3 2 . 

2. Define the functions x lmi and x { ~ n) . Obtain their nth 
differences. 

We have already defined the functions and found the value of 
A n x lm) in the last article. 

Now Ax l ~ m) — ( — m) 

A % x { - m) ={—m) ( —m — 1) 

and generally' A n x l - m, =(— l) n m (m-f-1) (m+2)... 

3. Represent the function x* — J2x 3 +24x 2 —30x-\-9 and its 
successive differences in factorial notation. 

Let * 4 -12x 3 + 42**-30 x+9 

=Ax (x-1 ) (x-2) (x-3) + Bx (x — 1) (x-2) + Cx (x-l) + Dx+E. 

Putting x = 0, 1, 2, 3 successively on both sides, we get 

E=9, D= 1, C = 13, B = — 6. 

Also equating the coefficients of x 4 on both sides, we get 
A — 1, so that the given expression is 

u x =x {i) - 6x (3 >-H3x‘ 2, +x‘»> + 9. 

Au x =4x' 2 >-6.3x' 2 > + 26x'" + 1 =4x‘ 3 >-18x‘ 3 > + 26x<»>-f-I, 
A 2 u,= 4.3. x< 2 > - 13.2x<*> -f 26 (1) = 12x< 2 > - 36, 

J 3 u,=24x ( », 

J 4 w»=24. 

4. Obtain the function whose first difference is 

x 3 + 3x a + 5x + /2. 

Expressing the given function in a factorial notation, 

Au m =Ax (x—1) (x — 2) + Z?x (x— l) + Cx-f D. 

Putting x = 0, 1, 2, 3 and comparing coefficients of x 3 on 
both sides, we get 

D= 12, C= 9, B= 6 and A = l. 

Jw,--=x (3, -f 6x (Z, + 9x <l, -fl2. 

Since Ax lrn) ==mx lm ~ l) 

so that x («-i) = L jx (m) , 

m 

we get u z ={x {X) + -\-\2x"' + E 



90 


STATISTICS 


=+ (.Y) (x-1) (*-2) (x-3)+2x (*-1) (x-2) 

-+§* (x-l)-J- 12 *+£ 

where is is any constant. 

5. Given u 0 , w ls u 2t u z , r/ 4 , m 5 (fifth differences constant) 9 
prove that ' r 

U2i = ic+ 2 ll^ b >+ 3 < a ~ c > 


256 


(Agra B. Sc. ’61) 


where a=u 0 -\-u 6t b=u l -\-u it c=w 8 -f i/ a . 

Using the previous notations, 

(l+d ) B/2 n Q =u Q +Uu*+— J a w 0 +^i A*u 0 

+ * ‘* ’ jf } d«a 0 + ° - ‘ * * * ’ ( J ( ~ J 6 u 0 

= w 0 +f (Hi-«o)+V («2-2 u 1 + w 0 )+- 3 b # (m 8 -3m 2 +3w 1 -w 0 ) 

“ll8 ( i 'a“ 4m 3+6u 2 -4u 1 +m 0 ) 

+2^6 ( M 6-5t/4+10u 3 -10w a +5u 1 -u 0 ) 


256 


3 25 , 75 . 75 25 , 3 

=256 W °~~256 Ml+ l28 “ a+ l28 M ‘+ 

_ 3 25 .,75 

256 " 256 fc+ l28 

, , 3a—256-4-226 
= 4c+ — 


256 1,6 


= ic-f 


256 

3a—256+25c—36 


256 

. . 25 (c-6)+3 (a-c) 

_ * c + 256 

6. // />, q, r and s be the successive entries corresponding 
^fdo equidistant arguments so that when third differences are taken 
into account , the entry corresponding to argument half way between 
the argument of q and r is 

tf+r , (q+r)—(p+s) 

2 + 16 

The difference table is as follows :— 


argument 

entry 

1st diff. 

2nd diff. 

3rd diff ,; 

(x) 

(u*) 

(Au m ) 

(A*u 9 ) 

(A*u x ) 

0 

P 

q-p 



1 

<1 

r-q 

r-2q+p 

s—2r+q 

s—3r+3q—p 

2 

r 

s—r 


FINITE DIFFERENCES AND INTERPOLATION 


91 


We are required to find the value of ithe entry corresponding 
to the value of the argument midway between 1 and 2 i.e. for f, 
so that jc=|. 


Wa/2=(l+d) 8/2 u 0 

-(nW (, - n 


^l,!_„ ( *_ 2)js+ ) Ho 


d* + 


2 ! " ' 3 ! 

=P+I (q—p)+a (r—2<7+/0-- 1 J a - (^ — 3r + 3^ — /?) 

=P+y-y+I'-i0+y-As+A'’-A<7+Ap 
-/> (1-2+t+A)+? (f-!-A-)+r (f+A)-iV 

16 " 7 ” 16 16 16 
_ <7+r_j_ ?+r—f p -f s) 

2 16 


5*7. Define *Interpolation \ IfVftf/ ore the underlying assumptions 
for the validity of the various methods used for interpolation ? 

(B. Sc. Agra ’61) 

Sometimes the values of a function for certain values of the 
independent variable are given and we are required to find the 
value of the function for some value in the given interval. For 
example, the census in India is done every ten years, (e.g. 1931, 
1941, 1951, 1961 and so on) and for ceitain purposes, it is required 
to estimate the population in any intermediate yeir, say 1944; then 
with the help of the available figures of census years, the popula¬ 
tion of 1944 can be approximately calculated. This process of 
finding the value of the function is known as interpolation. 


Interpolation is the operation of obtaining the value of a func¬ 
tion for any intermediate value of the argument, being given the 
values of the function for certain values of the argument. 

The word ' interpolare ’ in Latin means 'to polish up ’ as well as 
*to corrupt, to falsify ’ so that great care and caution is necessary 
for the use of method of interpolation. 

There are certain assumptions for the application of the 
method of interpolation. Firstly, it is assumed that there are no 
sudden jumps or falls in the values of the function during the 
interval considered. Thus an outbreak of war, epidemic, famine, 
large-scale immigration or emigration are some of the factors that 
cause sudden changes in population and the method of interpolation 
would fail in such cases. Moreover the rise and fall must be uniform. 
In finite difieiences methods, the functioo must be capable of being 



92 


STATISTICS 


expressed as a rational integral function of x. Thus if the function 
is not a polynomial in x, the series shall not be convergent since 

none of the differences shall vanish and the finite differences 

• ^ 

formulae would not be valid. If / (x) is a polynomial in x of 
degree n t then A n+1 and higher power differences vanish and the 
formulae for finite differences given hereafter stand valid. 

5*8. Algebraic Methods of Interpolation. If f (x) can be 
expressed as a function of x in the form 

y=a+bx+ cx*+. . 

the various given values of x and y giving the coefficients a, b, c.. t 
then the value of>> for any given value of x can be obtained. If n 
values of (x, y) are given, we can fit a (n— l)th degree curve. 
Another method is to plot the given values on a graph paper and 
draw a smooth graph through these points which would give the 
value of y for any value of x. 

5 9. Newton’s Formula for Equal Intervals. 
Ux+ n h=it x + n C 1 Au x +’'C 2 A 2 u x +... + n C t A r u x +.. -f A n u x . 

If n is a positive integer, we prove this formula by the 
method of Mathematical Induction. Let us assume it to be true 
for a value n. ' 

W »+(n+i) A —W*+« 

=(i/,+ n C 1 J!/ x + n C 2 J ? t/. +.. .+ n C r A r u x +.. .-f A n u x ) 

+A (u x -\- n C 1 Au x -\- t, CzA 2 u x -y ... +"C r A r u x +.. .4 A n u x ) 
= w*4rc 1 +l) Ju,4rC 2 4"C 1 ) A*u x 4... 

4("C r 4"C r _ 1 ) A r u x 4... 4 A”+'u x 
= 4" +1 CjJw*4 n+1 CoJ *u x 4... 4" +, C r J r u I 4. . 4^" +, w, 

so that if the formula is true fora value n, it is also true for 
value a 41* But we know that it is true for n = 1,2 and hence 
it is true for n— 3, 4, 5.. i.e. for all positive integral values of n. 

In this formula if we replace x by 0 and h by 1 and then n 
by a*, we get 

u x = w 0 + ir ^w 0 4'C 2 J 2 u 0 4... +*C r A r u 0 +... 4 A'u*. 

The interpolation formula given above can be'written as 

u x =u q +xAu q + X -!£^ zl 2 u 0 4... ' : 


(*■“!)•• .(a— r41) . 

+ -^ r w 0 +... 



has been proved for positive integral values of x. If. however, x 
is negative or fractional, the formula is not necessarily true and will 
hold only if u x is a polynomial in x of degree n as in this case 


FINITE DIFFERENCES AND INTERPOLATION 


93 


and differences of higher orders shall vanish, since it is only 
in a polynomial that the differences of (rt-fl)th order and higher 
orders necessarily vanish. 

There are two types of problems that arise. Out of a series 
of equidistant values of a function, some may be missing and we 
are required to find them. For example, we are given u 0 , u Jt u 3 , u 5 , 
w 7 and the problem is to find u a and w 6 . 

The second is that u 0 , u x% w 2 , tt 9 , u 4 are given and we may be 
required to find w x/2 , u 3/4 or some other intermediate values. For 
example, we may be given the census figures of 1921, 1931, 1941, 
1951, and the problem is to estimate the population of 1932 or any 
other intermediate year. 

5*90. Solved Examples. 

One Intermediate ValuejMissing. 

1. Estimate the annual rate of cloth sales for 1935 front the 
following data : 

Year Sale of cloth in lakhs of yards 

1920 250 

1925 28) 

1930 328 

1940 441 

(M. A. Agra ’57) 

Denoting the year by x, the sale by u M and the scale of year 
interval being 5 years we take the origin to be at 1920. 

u 0 = 250, u x = 285, m, = 328, u 3 =?, u x — 444. 

Since four values are given, we assume u, to be a polynomial 
of third degree so that d 4 u x =0. 

d 4 u 0 =(£-1)« Uq 

= (£«-4£ 3 + 6£''-4£-M) Uq, 

0=w 4 —4 m 8 +6w 2 —4uj +u 0 
or 4 m 3 = u 4 + 6m 2 —4m x -fu 0 

= 444-f(6 x 328) —(4 x 285)4-250 
= 1522. 
u 8 =*380*5. 

Hence the sale of cloth in 1935 will be approximately 380 5 
lakhs yards. 

Two Intermediate Values Missing. 

2. Interpolate the missing figures in the following table of rice 

cultivation : 



94 STATISTICS 



Year 



Acres in millions 



1911 



76*6 



1912 



78*7 



1913 



9 

• 



1914 



77*7 



1915 



78*7 



1916 



9 

• 



1917 



86*6 



1918 



ire 



1919 



78*7 





(M. A. 

Agra ’50, B. Sc. Agra ’61) 

Since seven values are 

given, 

, we regard u x to be a polynomial 

of degree six, 

so that A 1 u x - 

=0. 




Year 1911 

1912 1913 

1914 

1915 

1916 1917 1918 

1919 

x 0 

1 2 

3 

4 

5 6 7 

8 

u x 76*6 

78*7 ? 

77*7 

78*7 

? 80*6 77*6 

78*7 


#"0 — 7 4* 7 C2^6— CC^ C§u% -f- 1 C a tt l —w 0 

or 0=w 7 —7w fl -f-21w 6 —35 w 4 + 35i/ 3 —211/2-1-7^-Wo, ...(i)' 

0=77*6- (7 x 80*6)-f 21 m 6 —(35 x 78‘7)+(35 x 77*4) 

-21w 8 +7x 78*7-76*6. 

Also J 7 Mj=0 so that 

W8—7m 7 4-21 w a —35w 6 + 35w 4 —21 w 3 +7 i/ 2 —Mj=0 
or 0=78*7—(7 x 77*6)*f(21 x 80*6) —35w 5 -f-(35 x 78*7) 

—(21 x 77*7)-f*7w 2 —76*6.. .(ii) 

Solving (i) and (ii), we get 

115 = 80 * 5 , . 

i/ 2 =78*2, 

so that the estimated acres of land under cultivation in 1913 and 
1916 are 78*2 and 80*5 millions respectively. 

3. (7/ven sin 45°='7071, sin 50°='7660, sin 55°='8192, 

sin 60° ='8660, find sin 52°. (I. A. S. ’55, Punjab M. A. ’52) 


Angle 

X 

u x 

Au x 

A 2 u x 


45° 

0 

*7071 

*0589 



o 

o 

1 

*7660 

*0532 

-*0067 

•0003 

55° 

2 

*8192 

*0468 

— 0064 


60° 

3 

*8660 





FINITE DIFFERENCES AND INTERPOLATION 


95 


_ <52°— 45° 

For 52°, *=- 5 6 =1-4. 


Now m*=(1+J,)*w 0 
= ^1 -\-xA 4- 


7071+ (l-4x-0589)+ (l ' 4) (1 ~ 4 ~ l j (— 0067) 

+ ■• 4 ^-4x(--6) ( 0003) 


= -7071 + *0824 6 - *00187 6 - *00016 3 
= •7875. 


Hence sin 52°=*7875 nearly. 

4. constructing a difference table find the 7th term as well 
as the general term of the sequence 


x 


0 , 0 , 2 , 6 , 12 , 20 . 

u, A A 2 A 3 


(I. A. S. ’55) 
A 1 J 5 


0 

1 

2 

3 

4 

5 

6 


0 

0 

2 

6 

12 

20 

30 


0 

2 

4 

6 

8 

10 


2 

2 

2 

2 

2 


0 

0 

0 

0 


0 

0 0 
0 


The seventh term can be obtained in this manner : 
A 9 u t = 0 and A 2 u 3 = 2, so that A 2 u t =2, since E-==z\ -f A 
and Jw 4 = 8 and A 2 u l — 2 giving dw 6 = 10 

and w 6 =20, Jw 6 = 10sothat w 0 = 30. 

E'lV-O + Jj't/o 

or n f =i/ 0 +r.Ju 0 4-^pii 


-O+r.04- 


r(r-l) 

2! 


2 


= r (r—1), 


the general term. 



96 


STATISTICS 


5. From the following table, estimate the number of persons 
earning wages between 60 and 70 rupees :— 


Wages 

below 





in rupees 

40 

40-60 

60-80 

80-100 

100-120 

Number - 
of persons 
in thousands 

250 

120 

100 

70 

50 



(Agra M. Com. M. A. ’62, 

M. Com. ’51) 

From the given table, 




Wages in Rs. below 


40 

60 80 

100 120 

Number of persons in thousands 

250 

370 470 

540 590 

X 

u x 

A 

J* 

A 8 

A 4 

40 

250 

120 




60 

370 

100 

—20 

— 10 


80 

470 

70 

—30 

20 

10 

100 

540 


—20 



120 


590 


Taking *= ^q 4 °=1*5, 

u v6= u 0 +1-5Au 0 +<±^^ , 2|< o + ( MM-5-l)(kS -2)^ o 

W I 

^( 15 ) ( 1 - 5 - 1 ) ( 1 - 5 - 2 ) ( 1 - 5 - 3 ) .. 

+ -4T-- J4w o» 

where w 0 =250, zh/ 0 =120, A°-u 0 =-20 t A*u 0 =-10, A\= 20. 

Substituting the above values, we have 

«i. 6 =423*6 thousands, 

/.<?. the number of persons earning less than Rs. 70 is 
423*6 thousands. 


Hence the number of persons earning between Rs. 60-70 

= (423*6 — 370) thousands 
= 53*6 thousands. 

6 * ! f >’o. -Vi-.-Ja be the consecutive terms of a series, 
prove that 

>'»=■ 05 (>'o+y \)—3 (>\iys) + -7S (y, + y t ). 

(Agra B. Sc. ’61, Vikrani ’61) 


FINITE DIFFERENCES AND INTERPOLATION 


97 


Assuming y to be a polynomial of fifth degree, we have 

'A°y 0 =0 

or (E- 1)® y 0 ~E*y 0 -*C l E'y 0 +*C t E*y 0 - 9 C 3 E*y 9 +*C t E*y 0 

- 6 C 0 £>- 0 +>’o 

or 0= y c - 6y 5 -f 15 y t - 20 y 3 + 15 y 2 - 6y x + v 0 

or 20^ 3 =(y c -f> , o)~ 6 0 , 5+J , i) + 15 (Xi+yz). 

^ 3 =*05 (y 6 +>’o)—*3 (ys+y x )+-75 {y 4 +y t ). 

5*11. Lagrange’s Formula. On the assumption that the 
function u x can be expressed as a polynomial in x of degree n — I, 
where n values of x and u x are given, an interpolation formula is 
obtained as follows :— 

Let u 9 =A (x—b) (x—c).. .(x—k)-\-B (x—a) (x—c). ,.(x—k) 

A-... ~h E (x—a) (x b )...(x j) t 

where there are n terms each of degree n—l in x. 

To find A, B t .. .K, we give values to x. Putting x=a, 

we have 

u a =A (a — b ) (a — c ).. .(a - k) 

0F ~~ (a—b) (a—c).. .(a-k) 

Similarly putting x = b, we get 


( b—a) (b—c)...(b—k) 

and so on. Thus 

_ u a ( x—b ) ( x—c).. Ax—k) u b (x—a) (x —c) . .Ax — k) 
U *~ (a—b) (a-c)...(a—k) + (b—a) (b-c ).. .(b-k) 


_ Ux ____ 

( x-a) (x — b), .. (x— k)~~ (a—b) (a-c).. .(a—k) (x—a) 

_i__^_+ 

(b-a) (b—c). ..( b—k) (x-b) 
The method is exactly the same as splitting 

_Mr_ 

(x—a) (x—b)...(x—k) 




into partial fractions. 

Lagrange’s formula is more general than Newton s as it can 
be applied to unequal intervals as well, but the working in this 


case is veiy laborious. 


5*12. Solved Examples. 

1. The observed values of a function are respectively 168, 120 , 
72 and 63 at the four positions 3, 7, 9 and 10 of the independent 




98 


STATISTICS 


variables. What is the best estimate you can give for the value of 
the function at the position 6 of the independent variable ? 

(I. A. S. ’51, B. Sc. Agra ’59) 
x=6, a=3, b= 7, c= 9, d=> 10. 

w fl =168, w 6 =120, u 0 —12, u*=63. 


Applying Lagrange’s formula, 

w 6 


168 


(6 — 3) (6-7) (6-9) (6-10) (3-7) (3-9) (3-10) (6-3) 

+ __ 120 72 

(7-3) (7-9) (7-10) (6—7) (9—3) (9-7) (9-10) (6-9) 

63 


+ 


(10-3) (10-7) (10-9) (6-10) 


or 


W e 


giving 


zrle—4- 5 +2-f 

= —Aa 
— 1 1 * 

w 6 = 147. 


2. What is the value of the nth order difference of the rth 
term of the series denoted by the variable 3* ? 

The function 3* takes as it should the values /, 3, 9 and 81 
when x equals 0, 1,2,4 respectively. Applying any method of 
finite differences, obtain the value corresponding to x=3. Explain 
why the resulting value differs from J 3 or 27. (I. A. S. ’52) 

•^3*=3* +1 —3 X , the interval of difference being unity 
—3* (3 —1)=2.3*. 

Similarly, d 2 3*=2 2 .3®, 

J"3*=2 n .3*, 

A n 3 r =2\3 r . 

Here the given intervals are equal and hence we apply 
Newton’s formula. Assuming the function to be a polynomial of 
third degree. Here w 0 = 1, « 1 =3, u 2 =9, w 3 =? and w 4 =8l. 

d 4 w 0 =0 

or (£-l) 4 u 0 =0, 

giving w 4 —4 u 8 4-6m 2 —4«|-fw 0 =0 

or 81 —4w 3 -f-54—12-f-l =0, 

giving w 3 =31. 

The discrepancy between the actual value and interpolated 
value arises due to our assumption that 3® is a polynomial of third 
degree so that the fourth and higher order differences vanish 


FINITE DIFFERENCES AND INTERPOLATION 


99 


while actually d n 3*=2 n .3® and differences of any order do not 
vanish. 

3. Ifu x is a rational function of second degree , prove that 

a ■0-,-7S=&fc=9 u ' + <^TTr->> 

Jfcm* eva/Mflle 27 a 2 (6-cj. (B. A. Punjab ’54) 

According to Lagrange’s formula, 

u a (x—b) (x—c)^t/ 6 (x — a) jx—c) _ l _u e (x a) (x--b) 

Ux ~ ( fl — b) (a-c) (b-a)(b-c) < c-a){c-b) 

L Mx== r -r % -: {2x- (fe+c)} 

dx * (a—b) (a — c) 1 

{ d \ _(fc+c) u a _ (c + a)u b _ (a+b )u e 

dx («-*) <«-*> <*-«> <*-*> 

_ (b+c)u a _(£+fl)_M6_■ + 

~(a-b) (c—a)(a — b) (b-c) (c-a) (b-c)‘ 

Putting u x =x' t we get 

„ (b+c) o 2 j_ (c+a) _| (fl+6)c 8 

°~(a—6) (c—a) ( a—b)(b—c ) (c—a)(b—c) 

or 0=fl 2 (6 + c) (6-c) + Z> 2 (c+fl) (c- 0 ) + c 2 (a + *>) (*-£) 

or 27 a* (fc ? —c*)=0. 

5*13. Central differences. We have seen how on certain 
assumptions, Newton’s formula is applied to interpolate values of 
a function between the given values. Sometimes it is more advan¬ 
tageous to know the values such as w_ 3 , w_ 2 , w-i, M o» u lt u it u 3 . 9 , 
in place of u 0t u u u t ... and by advancing difference Newton's 
formula, any value of u x can be obtained in terms of a given value 
of u m and its leading differences. There are various formulae 
known as central difference formulae based on differences on either 
side of the origin and they are more convenient as well as result 
in more rapidly convergent series than Newton’s formula. 

514. Gauss’s formula. The pattern of difference table in 
this case is essentially the same. 


d 2 W_ 2 

d 3 W _ 2 

d*W_, ^W-2 

d 3 W_l 

A ? u 0 




100 


STATISTICS 


Now d 2 t/ () =d*M_ 1 -l-d 3 j/_ 1 , 

since A 2 u_ l + A 3 u- 1 =A t (1-fd) u- 1 ^A 2 Eu_ 1 =A*u 0 ‘, 
similarly A 3 u 0 =A :i u_ 1 -\-A i u^ i , d 3 w_ 1= A 3 u_ t -\- A*u. 2 and so on. 

u I =u 0 +xAu 0 +^^l ) z1V 0 + X|JC ~ 2 I) ! (X ~ 2) J=» 0 + • • • 
= H 0 +jrJ!/o+'^ ( 2T^ + »-i 

+ * u ~ 3 I) , u ~ 2) (4»+^) 

=» 0+ x^ 0+ ^-l) ^ + «£±ILS£tl) ^ 


+ (x+. )x(x-l)( x-2) JtUi + _ 


=u 0 +x (l) Au 0 +x l2) A*U- l + (x+l) (3) A*u- 1 + (x+\) {i) A i u- 2 +... 
where * (1 >+x ( 2 >=(*+ 1) (2) . 

This is known as Gauss’s ‘forward’ formula for equal intervals. 

(B. Sc. Agra ’55> 

If in the above formula we replace 
Au 0 by Au- l + A 2 u_ l ; d 3 u_! by d 3 u_ 2 -f d 4 w_ 2 and so on, we get 
W«=W 0 +X (1 ,ZlM- 1 +(Jr+l} (2 )^ , M-i + (X+l) ( 3)-d 3 M-*+U-b2) ( 4 ) J 4 M_ 2 

-t - • » • • 

This is Gauss’s backward formula. 

5*15. Stirling’s Formula. Taking the mean of the two 
Gauss’s formulae, we have 

i'x=w 0 +* i (^'o-Mfi-J+^j d^-T * - **" 1 * * ) (d^+d***-,) 

, x* (a 2 — l 2 ) .. 

^ W-2+ • • • » 


This is known as Stirling’s formula. In using it, x should lie 
in the range — | to +i i.e. i on each side of u 0 . 

Example. Given the table : 

x 310 320 330 340 350 360 

log x 2-4914 2'5052 2-5185 2 5315 2-5441 2 5563 

Find the value of log 3375 by a central difference formula. 

If we take 330 as the origin and 10 as the unit of x, the value 
required will be u. 7B for log 337*5. 


FINITE DIFFERENCES AND INTERPOLATION 


101 


X 

Ux 

Au. 

A*u, 

A 3 u x 

A*u x 

—2 

2-4914 

*0138 




-1 

2*5052 

*0133 

-•0005 

•0002 


0 

2-5185 

*0130 

- 0003 

•0001 

-•cooi 

1 

2-5315 

*0126 

— •0004 

0 

-•0001 

2 

2 5441 

*0122 

-•0004 



3 

2*5563 






The Gauss forward formula is 
W * = X{\)Au q -{- X l2) A 2 U —1 -f- (at-J- l)^ 3 ,zl ;1 w_ 1 -|-(x-f- 1)(4J 

«=2*5185+( 75X‘0130) + ( —(-‘0003) 

+ (W5MP7_5-IKL-_75-2) ^ 

+ (l-75)(l-75-.) (1.75^2^5-3) ( _ 0001) 

= 2-5283. 

Hence log 10 3375 = 3 5283. 

5.16. Distinguish between interpolation and extrapolation. 
What is the importance of these methods in practical life ? 

When the values of a function for some values of the variable 
are given, the process of finding the values of the function for some 
values of the variable between the given values is known as inter¬ 
polation while if the value of the function corresponding to some 
value of the variable outside the given values of the variable is 
required, the process is known as extrapolation. The assumptions 
for extrapolation are the same as for interpolation although 
extrapolated values are not so reliable as interpolated values. 
Extrapolation is used for forecasting which is one of the major 
tasks of statistics. 

It is obvious that interpolation is a necessary part in our 
practical life. Assuming that the conditions do not materially 
change, we have to make calculations regarding future in planning 
for production, agriculture, population etc. The insurance com¬ 
panies have to interpolate and extrapolate the average longevity 
of lile, the death rate at dilferent ages, the average insuiance 



102 


STATISTICS 


business they expect to get and thus fixing the premiums. In fact 
in every walk of life, we do same sort of interpolation and extra¬ 
polation. 

The different methods of interpolation, namely graphic and 
algebraic, used for interpolation are applicable for extrapolation 
with the same assumptions. 









EXERCISES 
With the usual notations, show that 

(i) A*u 2+ x ( *~‘j 


3 ! 

=u x +xA 2 u ^ x +- - ■ A*u 




1.3.5 


.. . 


(ii) w*— \A*u x —a • i 8 c^ 4w ®-2 g 24 " M *“ 3 

= M z+i/2"“i^ M x+i/2+i^ 2w »+i/a”8^ 8w *+i/a4" • • • • 

Find Aab « and A 2 ab c *. [{b c - 0 atf* (6 C -l) a ab e * ] 

Find A n ( ax n +bx n - 1 ). (a.n !) 

The following table gives the census of population of an 
Indian State ,in 1901, 1911, 1921 and 1931. Estimate the 
population of the State in 1924 making your method clear. 

Year Population (in thousands) 

1901 2,797 

1911 ' 2.935 

1921 3,047 

1931 3,354 

(U. P. P. C. S. *39) [3108*5 thousands nearly] 
(Hint—Let the equation of the curve be y=a+bx+cx 2 +dx*.) 

If /, represents the number living at age x in a life table, find 
as accurately as possible/, for values of x=35, 42 and 47, 
given /o 0 =512; / M =439; / 40 =346; / 60 =243. (I. A. S. ’48) 

[/ 35 =394, / 42 =326, 4*7=2741 


The table below gives the expectation of life at different ages. 
Find the expectation at age 49. 

Age 35 45 55 65 

Expectation 34 26 18 12 

(B. U. ’53) [22*5] 

The age of mothers and the average number of children bom 
per mother are given in a table below. Interpolate the average 
number of children born per mother aged 30-40. 


FINITE DIFFERENCES AND INTERPOLATION 


103 


Age of mothers in 

years 15-19 20-24 25-29 30-34 35-39 40-44 

Number of children 

born 0*7 2*1 3*5 — 5 7 5*8 


(P. C. S. ’43) [4-8J 

Determine by Lagrange’s formula the percentage number of 
criminals under 35 years. 

Age % number of criminals 

under 25 years 52*0 


9f 

30 



40 

»> 

i* 

50 

f # 


67-3 

84*1 

94*4 


(M. A. Agra ’34) [77*43] 

From the following table determine e ° 1215 and e° 1695 : 


X 

0*12 

0-13 

0-14 

015 

e* 

1*127497 

1*138828 

1-150274 

1-161834 

X 

0*16 

0*17 

0-18 

0*19 

e* 

1*173511 

1-185305 

1*197217 

1-209250 


[1*132582, 1*208645] 


Prove that 


(a) W4=w 3 -f/Ji/ 2 +d 2 w 1 4- d 3 w 0 . 

(b) Jx m — .lA 3 x m — .. ,m terms 



CHAPTER VI 
PROBABILITY 


6-1. Introduction. In our daily life we often come across 
sentences like— 

(1) It is quite probable that he may go out today. 

(2) Most probably I shall be returning within a week. 

(3) It is impossible that he may refuse to do my work. 

In the above sentences, the word ‘probable* refers to chance 
or likelihood of the happening of some event. Thus in the third 
sentence the word ‘impossible* suggests that there is absolutely no 
likelihood of the person refusing the work or, mathematically, we 
may say that his probability of refusing to do the work is zero, or, 
in other words, the probablity of his doing the work is unity. ' 

6*2. Definition. The Classical or Mathematical or a priori 
definition of probability is as follows :— 

If there are s exhaustive, mutually exclusive and equally 
likely outcomes of an event and r of them are favourable to the 
happening of an event A, then the probability of the happening 
of A is 

P(A, = r y 

It is sometimes expressed as odds in favour of A are r to s—r 
or the odds against A are s—r to r. 

Note. By exhaustive we mean that one of the events must 
happen i. e. when they include all possible outcomes. Thus when 
we toss a coin, either it must fall heads or tails (the possibility of 
standing on the edge is ruled out) Similarly the throw of a 
hexagonal die must result in one of the six faces falling upper¬ 
most. Mutually exclusive means that the happening of an event 
excludes the possibility of the happening of the other event in 
the same trial, no two or more events can happen simultaneously. 
Thus when a coin is tossed, the falling of heads in a toss excludes 
the possibility of falling tails in the same toss. Cases are said 
to be equally likely when the chances of the happening of an 
evenj are not greater or less than those of any other. Thus if 
we throw a perfectly symmetrical die, there is nothing to suggest 


PROBABILITY 


105 


that the coming uppermost of one face is more or less likely 
than any of the others and hence all faces are equally likely 
to come up. It may be noted that the probability is a quantity 
which lies between 0 and 1. 

It can thus be seen that the probability of falling heads in 
the toss of a symmetrical coin is £ and so for the tails. The 
probability of drawing a black card out of a pack of cards is £, 
since there are equal number of cards of black and red colours in 
a pack, the probability of drawing a card of hearts in orj as 
there are thirteen cards of hearts and so on. If there are a 
black balls and b white balls in an urn and one ball is drawn at 


random, the probability of a black ball coming out is and that 

of a white ball is z . 

a-j-b 

The above definition of probability deals with cases, where 
all the events are equally likely. Thus in the case of a biased 
coin or die this definition fails; moreover in the cases, where the 
number of possible events is infinite as in the case of a continuous 
variable, this definition does not offer a plausible explanation. 

6*3. Statistical or Empirical Definition. If a large number 
of trials be performed under the same conditions , and if the 
limit of the ratio of number of happenings of an event to the total 
number of trials is unique and finite, then this limit is known as the 
probability of the happening of that event. 

If an event A occurs' r times in a series of n independent 
random trials and if there is a sequence of these n trials, the event 

A happening r { times in the fth set of trials, the probability of the 
event A is 


P(A) = 


Lt r*_ 
n—*- oo n P* 


provided the limit is unique and finite. 

The above definition involves the idea of what is known as 
relative ft equency. If m is the number, of successes obtained in 


n independent trials, then the quotient — is known as the relative 

n 

frequency of successes. The assumption in the statistical 
definition is that this relative frequency in a series of trials 
performed under the same conditions tends to a definite limit as 



106 


STATISTICS 


the number of trials tends to infinity and this limit is taken as the 
probability of the occurrence of the event. 

6*4. Independent, Compound Events. 

(i) An event A 2 is said to be independent of A t if the 

occurrence of A x or ~A X (non-occurrence of A x ) does not alter the 
probability of A 2 e. g. the outcome of the toss of a coin is 
independent of the previous toss or tosses. 

(ii) Events are said to be compound (or decompable) when 
they can be decomposed into a number of simple events. Thus 
the throw of a sum of 8 with two dice can be decomposed into 

(2, 6), (3, 5), (4, 4), (5, 3), (6, 2). 

6*5. State and prove the addition and multiplication theorems 
of probabilities. (Agra M. Sc. ’52, ’53) 

Addition Theorem of Probabilities. The probability that one of 
the several mutually exclusive events shall happen is the sum of the 

probabilities of the separate events. 

Let m x be the cases favourable to the happening of an 
event A x out of the n mutually exclusive and equally probable 
events, m 2 to the happening of another event A a and so on, 
ni{ favourable to A{. Since A x , A 2t .. ,A{. . .At arc mutually 
exclusive, the cases favourable to A x , A 2 ... A (... A k are 
m 1 4 /w l +...+w<+...+% Hence 

/> iA l+ A i+ ... + A k )= m ' +m ' + ;- - tP iL 

= 51 +?!!+...+^ 
n'n n 

=7>i+Ai+ • • • ArPk 

k 

= E pi 
i=l 

= Z P W. 

« = l 

where P (A x +A 2 + ... -f -A k ) denotes the probability of happening 
of one of the events A lt A a ... A k . 

Multiplication Theorem of Probabilities. The probability of 
the combined happening of two events A x and A % is the product of 
the probability of A x multiplied by condition probability of A 2 on the 
assumption that A x has happened or vice versa 

P (A X A,)=P (A x ) P (AJA X ), P (A x )y£0, 
where P { A Z IA X } denotes the probability of A s on the assumption 


PROBABILITY 


107 


that A x has already happened and P (A X A 2 ), is the probability of the 
combined happening of A x and A 2 in the same trial. 

Let m x denote the cases favourable to the happening of A x 
out of n mutually exclusive, exhaustive and equally likely events 
and out of these m x cases let m a be favourable to the happening 
of A 2 . Then 

P(A x A t )=^ 


_m x m 2 
n ' m x 

— P (A X ).P ( A 2 \A X ), 

since P (A t IA x )=— as out of m x equally likely cases favourable 

m i 

to A x there are m 2 cases favourable to the happening of A 2 . 


Also P (A X A 2 ) = P (A 2 ).P(AJA 2 ). 

Generalised form of the above theorem is 
P ( A x A 2 ...A k )=P (A x ) P (AJA t )...P ( AJA X A 2 .. 

If A x and A 2 are independent, 

P (A 2 /A x )=P (A 2 ) etc. 

In this case we get P (A X A 2 .. ,A t ) = P ( A X ).P (A a )...P (A k ). 
Some Important Results. 

(i) P(A) + P (A) = l. 

Since A and A are mutually exclusive events, 

P (A) + P (A) = P (A + A) by addition theorem 

- 1 . 

since A must either happen or not happen 

(ii) P (B) = P (AB)-f-P (AB), 

since B can happen in two ways, either A happens or does not 
happen while B happens. 

6*16. What is meant by the probability of an event ? 
Prove the formula for the probability of occurence of at least one of 
the two given events and generalise to prove the following theorem : 

P + •.. -\-A n ) =*S X — S 2 -\-S 3 — ,.. -j- f — /) n ~ l s nt 

where S { stands for the sum of the probabilities of simultaneous 
occurrence of exactly i of n events, the summation extending over 
all possible combinations. (M. Sc. Agra ’62) 

If A x , A t ...A n are not exclusive and P (A x +A 2 ) denotes the 
probability of happening of at least one of the two events A x 
and A z , 



108 


STATISTICS 


P (A x +A t )=P (A x A 2 -\- A x A 2 -\~ A 2 A x ) 

= P ( A x A t )+P (A X A 2 ) + P (AM. ... (i) 

Also P(A X )=P (A X A 2 )+P (AM, ...(ii) 

P (A 2 )=P (AM+P (AM. ...(iii) 

Substituting the value of P (A X A 2 ) and P (A 2 A X ) in (i) from (ii) and 
(iii), P (A X +A 2 )=P (A x )+P (A 2 )-P (A x A 2 ). 

Similarly, P (A x + A 2 + A 3 )=P {(A x -\-A 2 )-\-A 2 } 

=P ( A\+A 2 )+P (A a )-P {(A x +A 2 ) A s } 

=P (A x ) + P (A 2 )+P ( A 2 )-P (A x A 2 ) 

- P (A x A 3 )-P (A 2 A 3 ) + P (AtAoA ? ) 

and so on. Generalising, we get 

P {A,+A.+ .. +A„)=S P (A t )-X P (A,A))+ Z P (A,A}A k )~... 

i, j *'j>k 

i y*j i * k 

= S l — S 2 +S 3 — . . . +(— l )" -1 •S'n. 

6*7. Binomial and Multinomial Theorems. 

Binomial Theorem. If the probability of success in one trial 
is p and that of failure is q so that p + q = l t then the probability of r 
successes in n trials is given fay 

T, C r p r q n ~ T 

or the (r+l)th term in the expansion of ( q+p) n . 

(Punjab M. A. ’56, Gujrat ’58) 

There are in all n trials, out of which r can be *taken in n C r 
ways. Again the chance that the event happens in r trials and 
fails in n—r trials=/7 r <? n_r . 

Thus the chance of exactly r successes= n C r p r .q n ~ r . 

Putting r— 1, 2, 3,... we get the probability of happening 
exactly once, twice, thrice,... in n trials. Thus the probability 
that the event will happen exactly r times in n trials is the (r+l)th 
term in the expansion of (q+p) n . 

Multinomial Theorem. If a die has f faces marked with 
I, 2 t ... f, the probability of throwing a totalp with n dice is given 
by the coefficient of x v in the expansion of . ,+x f ) n 

divided by 

n dice can fall in any of the ways and the manner in which 
the marks on them shall sura to p is the coefficient of x p in the 
expansion of (x*+x 2 + . ,.+x / ) n since this is total number of ways 
in which 1, 2, 3.../taken n in all can add to p. 


PROBABILITY 


109 


Also the total number of ways in which n dice can fall with / 
faces each is f n since each die can fall in /ways-. 


Hence the required probability 

^coefficient of x p in the expansion of f* 1 + ;*:*-}-.. ,x f ) n 

/" 

_ 1 n ! 

/" p ! q ! r !...’ 

where p t q, r % ... are/quantities such that p + q+r + ... = n. 


6*8. Given n independent events with respective probabilities of 
occurrence ,o„ p 2 , .. ,p n , write down the probability of at least one of 
the events happening . (I. A. S. *51, M. Sc. Agra ’57) 

Let us consider the probability of none of the events happen¬ 
ing. The probability of the not happening of the first event is 
U—Pi)* of the second not happening is (1— p 2 ) and so on. Since 
they are independent, the probability of none of the events 
happening is 

(1 —Pi) (1-/>*)...( 1— />„). 

Hence the probability of at least one of the events happening 
can be obtained by excluding this case i.e. none of the events 
happening. 

The required probability is 
1—(1—(1-Pe)...(l -Pn) 

=27 p t —27 p^p^ -f- 27 PiPiPk • • • "t"(— l) n 1 PtPi- • • Pn ■ 

i J*j 


6*9. Solved Examples. 

1. The chance of an event happening is the square of the 
char.cc of a second event but the odds against the first are the cube 
of the odds aga inst the second. Find the chance of each. 

Let the chance of first event happening be p and that of the 
second be p'. Then 


and 

or 

giving 


P=P' Z 

1 ~P—(\~P' S \ 

P V P' ) 
\ - P'*-P' ? 
P " 1 P ' 3 

p' = i and p= J. 


2. There are four letters and four addressed envelopes. 
Find the chance that all letters are not despatched in the right 
envelopes . 

The total number of ways in which 4 letters can be put in 
4 envelopes=4 !. 



110 


STATISTICS 


Also all the letters can be put in right envelopes in one 
way only. Hence the probability of all letters being put in right 

envelopes = 


1 23 

The required probability = l—^= 24 . 

3. In a hand at whist what is the chance that four queens are 
held by a specified player ? 

The particular player gets thirteen cards, of which four are 
to be the queens in this case and 9 other cards out of the rest 
48 cards. 

The four queens can be selected in 4 C 4 ways and the rest 
9 cards in 48 C 0 ways. 

Also the total number of ways in which the specified player 
can get 13 cards = 62 C 13 

«C y 48 C 

Hence the required probability =—— 8 

'C 13 

_ 48 C 9 

“<V 

4. A problem in statistics is given to three students whose 

chances of solving it are $ and What is the probability that 
the problem will be solved ? (Agra M. Sc. ’57) 

The probability that none of the students shall be able to 
solve the problem is 

(I-I)X(I-HX(I-}). 

Hence the probability that the problem will be solved is 

i-O-l) (l-i) 0-i)-*. 

5. An experiment succeeds twice as often as it fails. Find 

the chance that in the next six trials, there will be at least four 
successes. (Agra B. Sc. *55) 

If p is the probability of success and q of failure /> = §, 
since p + q=s\. 

The probabilities of 1 success, 2 successes and so on are 
given by the terms of the expansion of ( p + q) n or in this case 

<s+*) 6 . 

We add the probabilities of 6 , 5 and 4 successes 

(fl)°+ 6 C 1 (§) 6 (}) + e C 2 (§) 4 (i ) 2 

_496 

30 * 

6 . IJ 4 whole numbers taken at random are multiplied to - 


PROBABILITY 


ill 


geiTier, show that the chance that the last digit in the product is 

1, 3,7 or 9 is (Agra M gc > 48 , , 61) 

In any number the last digit can be 0, 1, 2, 3, 4, 5, 6, 7, 8, 9. 
Hence the probability that it is divisible by 2 or 5 is ■&, since in 
that case the last digit can be one of 0, 2, 4, 5, 6 or 8. The 
probability that the number is not divisible by 2 or 5 is §. In 
order that the product is not divisible by 2 or 5 P none of the 
constituent numbers should be divisible by 2 or 5 and its probabi¬ 
lity is (|) 4 = This is the probability that in the product the 


last digit is I, 3, 7 or 9. 

7. What is the chance that a leap year selected at random 

will contain 53 Sundays ? (Agra M. Sc. ’55) 

A leap year, consists of 366 days and shall have 52 complete 
weeks and two extra days. These two days can be (i) Monday 
and Tuesday, (ii) Tuesday and Wednesday, (lii) Wednesday and 
Thursday, (iv) Thursday and Friday, (v) Friday and Saturday, 
(vi) Saturday and Sunday, (vii> Sunday and Monday. 

Of these seven cases, the last two are favourable and hence 
the required probability is f. 

8. A and D stand in a ring with ten other persons. If the 
arrangement of twelve persons is at random, find the chance that 
there are exactly\ three persons between A and B. (Agra M. Sc. ’51) 

Three persons can be selected out of 10 in 10 C 3 ways. If 
we fix the positions of A and B, three persons between them can 
be arranged in *P % or 3 ! ways. Similarly the seven other persons 
can be arranged in 7 ! ways. Also A and B can be arranged 
among themselves in the two positions in 2 ! ways. 

The total number of ways in twelve persons can be arranged 
in a ring is 11 ! since we can fix the position of one man and 
allot positions to others. 


Required probability = 


3 ! x 7 ! x 2 1 x | 0 C 3 

11 ! 


- 1 1 • 

9. If on average I vessel in every ten is wrecked, find the 
chance that out of 5 vessels, 4 at least will arrive safely. 

(Agra M. Sc. *59, Lucknow B. Sc. ’48) 

Let the probability of wrecking be q and safe arrival be p. 
The various probabilities shall be given by the term in the 
expansion of (< 7 +p) 6 . 



112 


STATISTICS 


Also in this case We have to find the 

• '• # ^ • 

probability of survival of 4 and 5 vessels. 1 ' 1 ' 

+ ‘ C '© ©* + ' c> (i5X®)' + ‘ C ‘©* 

The required probability is 

_45927 - * . 

50000* 

10. A and B are two independent witnesses (i. e. there is no 
collusion between them) in a case. The probability that A will 
speak the truth is x and the probability that B will speak the 
truth is y. A and B agree in a certain statement. Show that the 
probability that this statement is true is 


xy _ . . . 

l—x—y+2xy' (Lucknow B. Sc. ’49) 
A and B both agree when either both of them are speaking 
the truth or making false statements. Hence the ratio of cases of 
agreeing is *y+(l — x ) (1—y). 

The ratio of cases of their both speaking the truth is xy. 

Hence the required probability=-^- 

’ ' xy+(i-x) (!-y) ■ 

_ xy 
l—x—y+2xy' 

11. A lot of 100 pens contains 10 defective pens. 5 pens are 
selected at random from the lot and sent to the retail store , What 
is the probability that the store will receive at least one defective 
pen ? 


Let us consider the case when the store does not receive 
any defective pen. 5 non-defective pens can be selected out of 
90 in 00 C S ways. Also the total number of ways in which 5 pens 
out of 100 can be selected is 100 C 6 . 

The required probability 


= 1 - 


°°C 6 

mr 


= *4162 nearly. 


PROBABILITY 


113 


12. Three groups of children contain 3 girls and 1 boy , 2 girls 
and 2 boys, 1 girl ond 3 boys. One child is selected at random 
from each group. Show that the chance that the three selected 
consist of 1 girl and 2 boys is ££. 

(Agra M. Sc. ’55,'.’59, B. Sc. ’53) 

The selection can be made in the following manner :— 

(i) ‘ Boy, boy, girl j Probability = £ X 4 X \ 

(ii) Boy, girl, boy „ =*xjx* 

(iii) Girl, boy, boy „ = |xix£. 

Since these are mutually exclusive events, the required 
probability is the sum of the probabilities of the three cases. 

The required probability = if. 

13. The odds that a book wilt be reviewed favourably by 

three independent critics are 5 to 2, 4 to 3 and 3 to 4. What is the 
probability that of three of the reviews , a majority will be 
favourable ? (Agra M. Sc. ’54, B. Sc. !55) 

The probability that it shall be reviewed favourably by first 
critic is f, by second £ and by third ?. For two or three of the 
reviews to be favourable, the probabilities are as follows :— 
Favourable, favourable, unfavourable Probability^ x f x (1 — ?) 
Favourable, unfavourable, favourable „ =?x(l-^)xi' 

Unfavourable, favourable, favourable ,, =(l— f)x£x£ 

Favourable, favourable, favourable ,, = £xfx$, 

since they are mutually exclusive events, law of additive probabi- 
ities applies. 

Required probability 

= is{(5x4x4j4-(5x3x3)-F(2x4x3)4-(5x4x3)} 

__209 

* 343 * 

14. Four persons are chosen at random from a group contain¬ 

ing 3 men, 2 women and 4 children. Show that the chance that 
exactly two of them will be children is (Agra M. Sc. ’63) 

Two of the selected persons are children; then the rest two 
can be selected from 5 persons i.e. 3 men and 2 women. 

The children can be selected in *C 2 ways and the rest of the 
selections can be made in S C 8 ways. 

Hence the required probability 

4 C,x 6 C, 



114 


STATISTICS 


4x35x4 

JT 2~ 

9x8x7x6 
1.2.3.4 
■ 10 
i == 2l* 

15. 77/e probability that a worker drawn at random from a 
certain factory is male is *651 and that a worker is married is '701, 
that a worker is married male is '472. Find the probability that 
a worker drawn at random is 

(a) a married female, 

(b) a single female, 

(c) a male or married or both. 

Let us denote the males by A, so that the females are denoted 
by A and married by B and unmarried by B. 

Given P (A)='65t, P (B)=*701, P (AB)='412. 

(a) P (AB) = P (B)—P (AB), 

= *701—‘472*= *229. 

(b) P (aB) — P (A)—P (AB) 

= (I —*651)—*229=* 120. 

Now in (c), a married person includes married males, so 
that the problem reduces to finding out the probability that either 
the worker selected is a married one or a single male, and they 
are mutually exclusive. 

(c) P (B) + P (AB) = P (B)+P (A)-P (AB) 

= -701+’65l--472 
= • 88 . 

16. A is one of the six horses entered for a race , and is to be 
ridden by one of the two jockeys B and C ; it is 2 to 1 that B rides A, 
in which case all horses are likely to win ; if C rides A his chance is 
trebled. What are the odds against his winning ? 

The probability of B riding A is § and 'in this case all six 
horses are equally likely to win. Hence the probability that B 
rides A and wins =$ x J 

The probability of C riding A is 1—§ or £ and in this case 
the chance of A’s win is three times of the chance in the previous 
case i.e. &x 3 or h. 

Hence the probability that C rides A and wins 

= *xi = J. 


PROBABILITY 


115 


Since these are mutually exclusive events, the total probabi¬ 
lity of A’s winning the race = £ + £ = ,* 8 . 

Hence the odds against his winning are 13 to 5. 

17. Goddard, the captain of the West Indies Cricket team is 
reported to have observed the rule of calling * heads’ every time the 
toss was made during the five matches of the last test series with 
the Indian team . What is the probability of his winning all the five 
matches ? (I* A * S. ’50) 

How will the probability be affected if he had made a rule 
of tossing a coin privately to decide whether to call ‘ heads or tails 
on each occasion ? 

The probability of falling 'head’ in one toss of a coin is $ 
and the probability of ‘heads’ falling all the five timcs = (J) 5 = - 3 1 2 - 

The tossing of a coin privately does not affect the probabi¬ 
lity in the test match toss in any way, since the tosses of a coin are 
independent events. 

18. A card is drawn from a pack, the card is replaced and 
the pack shuffled. If this is done six times, what is the chance 
that the cards drawn are 2 hearts, 2 diamonds and 2 black cards ? 

If we denote the probabilities of drawing a card of hearts by 
x, of diamonds by y and of black colour by z, the probabilities 
of the different draws will be given by the terms of the multi¬ 
nomial expansion (x-1 \-y+Zp. 

Hence the probability of drawing 2 cards of hearts, 2 of 
diamonds and 2 of black colour is the term containing x z y‘z 2 . 

The required probability * 2y * zi ‘ 

L’ut x=\,y = \ and z=*i. 

the required probability = j 7 2 i (i) 2 (J) a 

= 45 
512* 

19. What is the chance that a hand of five cards contains at 
least two aces ? 

The hand can contain two, three or four aces. The probability 
of hand containing two aces and three other cards is 4 C 2 x 1M C 3 , 
similarly of three aces is 4 C 3 x ,8 C 2 and four aces is a C 4 x 48 C 1 . 

The required probability 

= <*C, x *»C 3 ) + («C, X + 48/»C S . 


116 


STATISTICS 


20. Out of 20 consecutive numbers two are chosen at random ; 

find the probability that their sum is odd. 

The sum will be odd if one of the numbers is odd and the 
other is even; also the even and odd numbers are ten each. 

Having chosen the first number arbitrarily, the second number 
should be chosen of the other type i.e. if the first number chosen 
is odd, the second number should be even and vice versa. Now 
having chosen one number (say odd) in the beginning, 19 numbers 
are left, 10 of which are even, and heuce the probability of choosing 
an even number is J®. The same argument holds if the first 
number chosen is even. 

The required probability = i£. 

[Note—The probability that the sum of the numbers chosen 
is even = 1 — 

The student should find the probability independently.] 

21. If three squares are chosen at random on a chess board , 
show that the chance that they should be in a diagonal line is 74 *. 

A chess board is a square divided into 64 equal squares 
parallel to the sides of the outer 
square as shown in the figure. 

We can choose three squares in 
a diagonal line parallel to BD in 
the triangle ABD along the 
dotted lines, as shown in the 
figure. It can be seen easily that 
along the uppermost line, there 
are only three squares and hence 
the selection can be in 3 C 3 ways, 
along the second line in 4 C 3 ways 
and so on. Hence the number 
of ways three squares in a diago¬ 
nal line can be chosen in is ABD is 

3 C 3 + 4 C 3 4- c C 3 -f' } C 3 + 7 C 3 -l-*C 3 . 

Similarly in the is BCD, the squares can be chosen parallel to BD 
in an equal number of ways. 

Hence the total number of ways in which three squares can 
be chosen in a diagonal line parallel to BD is 

2 ( 8 C 3 + 4 C 3 4- 5 C 3 -f 8 C 3 + 7 C 3 ) + «C 3 , 
since the line BD is common to both the triangles. 



PROBABILITY 


117 


The same argument applies to squares parallel to AC and 
hence the total number of favourable ways 

=4 {•C 9 +*C 9 +'C 9 +*C Z +'CJ + 2.*C>=*92. 

Also out of 64 squares, three can be chosen in 61 C 3 ways. 

392 

Hence the required probability = 04 ^r 

_392 x 1.2.3 
64.63.62 
= 7 
744' 

22. In each of a set of games it is 2 to J in favour of the 
winner of the previous game; what is the chance that the player who 
wins the first game shall win 3 at least' of the next four games ? 

(B. A. Punjab ’56) 

The chances are 

w , w t w; w t w, l, w: w, l, w, w; l, w t w 9 w 
where w stands for a win and I for losing the game. 

The respective probability for 

w, w, w is (§) 8 , 
w, w, l, w is (§) 2 . M, 
w, /, w, w is l. 
l,w,w,w is 

Since they are mutually exclusive events, the required probability 
is the sum of these. 

The required probability=■& + li 

_ \ 

©• 

Note—The first case includes both the cases whether he loses 
or wins the fourth game as both probabilities are included in the 
question. 

23. Nine cards are drawn at random from a set of cards. Each 
card is marked with one of the numbers 1, 0 or —I and it is equally 
likely that any of the three numbers will be drawn. Find the chance 
that the sum of the numbers drawn is zero. (B. Sc. Agra ’60) 

The number of favourable ways when the sum of the numbers 
on the cards is zero is the coefficient of x° in the expansion of 

Now (r'+l+r)*=(l+rHx) # 



118 


STATISTICS 


=!_ I1=*V 

* 9 \i-* J 

-ati-^U-4- 


= ’ (1—9.x 8 +36x 8 —84x°+.. •) 

* 9 

X ( i -f9*+45* 2 +165 x 3 f... +3003*® 

+ • • • + 24310*®+ .. .)■ 

Coefficient of *°=24310-(9x3003)+(165x36)-84 

= 3139. 

3139 

Hence the required probability= -y-. 

24. Jf p is the chance that an odd number of aces turn up 

when n ordinary dice are thrown , show that 1 2p=^( gj n . 

The probability of throwing an ace by an ordinary die=£. 
Hence the probabilities of throwing 1,3,5,... aces in a 
throw of n dice is given by the alternative terms of the binomial 

(g + J) B , so that 

p= (t) B “* (*)+"C, (S)"" 3 (£) 8 +"Q (5) n “ 5 (£) 5 +--- 

= (*)" { n C l 5 n “ 1 4- n C J , S*-*+ n C t 5 n - 5 +...} 


-(»»■ 

= i (i)- {(6)"—(4)"} 


'1 


-* 0-<§)"} 

so that 2p=\— ($) n 

or . 1-2p-CD*. 

25. What is the probability of getting 9 cards of the same 

suit in one hand at a game of bridge ? (I. A. S. *51) 

The particular player can get 9 cards out of thirteen of one 
suit in 13 C 0 ways and 4 cards of some other suit in 30 C 4 ways. 

Since there are four suits in a set of cards, the number of 
ways in which he can get nine cards of the same suit 

= 13 C 0 x 39 C 4 x4. 

Also the number of ways in which thirteen cards can be 
given to the player= 62 C 13 . 

13£ x4 

Hence the required probability =--* 

26. At a deal for bridge, the player A has received two aces. 
Find the probability for each of the possible numbers of aces that 
may have been dealt to his partner. 


PROBABILITY 


119 


A having received all his thirteen cards, his partner can have 
13 cards out of the thirtynine. 

In case his partner does not have any aces, the partner can 
have 13 cards out of 37 only, since he cannot have any card out 
of i4’s cards and the rest two aces. 

Hence the probability that A's partner does not have any aces 

= 37 c 13 rc 13 

_ (37)! (13)! (26)! 

(13) !. (24) r (39) ! 

_25 
“ 57 * 


The probability that /I’s partner has one ace 

37 C' 2 f 

_ '~I2» '-'l 


_ 26 
57’ 

since he can have one ace out of the remaining two and twelve 
cards out of the remaining 37. 


Similarly the probability that the partner has two aces 

37 C,,x 2 C 2 

"C n 

= 2 
19* 

27. If x be one of the first hundred numbers chosen at random , 
find the probability that x+ — is greater than 50. 



x + ^° > 50 

X 

or 

x*-50x+100 > 0 

or 

(x-25) 2 > 525 

or 

X 

1 

to 

V 

to 

to 

i e. either 

x > 47 

or 

< 3. 


Thus x can take 55 values, 2 being less than three and 53 
greater than 47 i.e. either 1 or 2 or any one from 48 to 103. 

The required probability = ^q 

n 

” 20 * 





120 


STATISTICS 


28. A bag contains n counters marked l t 2 t 3,..,n. If two 
counters are drawn, show that the chance that the difference of the 

counters exceeds m (less than n—l) is 

(n—m) (n—m — 1) 
n (n—l) ■■ 

The difference can be —1. Now the diffe¬ 

rence can come out to be n — 1, in only one way, when one counter 
is 1 and the other n; it can be n-2 in two ways when the counters 
are 2 and n or I and n—l. Similarly the difference will be n-r 
in r ways, the counters in this case will be numbered 

1, n— r+1, 

2, n —r-|-2, ' • : 

• • • • • • 

r # n. 

Hence the total number of ways in which the difference is 
n—l, n—2, .. .m+1 is 

1 + 2 -4- 3 -f-... + w — 0 

{n — m— 1) (n—m) 


The total number of ways in which two counters can be drawn 
from among n is "C 2 . 

Hence the required probability 

(/i— m— 1) (n— m) n (n— 1) 
“ 2 / 2 

_ (n — m) {n—m— 1) 

n(n— 1) 

29. Find the probability that at a deal for bridge, at least 
one of the four players will have thirteen cards of the same suit. 

(Delhi M. A. ’59) 

Let the four players be A, B , C, D and let A also stand for 
the event *A possessing thirteen cards of the same suit', B the 
event 'B possessing 13 cards of the same suit’ and soon. We 
have to find the proabilities that at least one event A , B , C, D will 
occur. We denote it by P {A + B+C+D). 

From § 6*16, we have 

P {A + B+C+D)=£P {A)-ZP {AB) + ZP{ABC)-P {ABCD). 

Now P (A) = P{B) = P (C) = P{D), 

P {AB)~ P {BC) =P (CD)=P (DA)=P {AC) = P ( BD) t 
P (ABC)— P {ABD)=P {BCD) = P {ACD). 


PROBABILITY 


121 


Hence P {A + B+ C+Z)) = 4/> (A)-6P(AB) 

-f 4P (ABC) — P (ABCD ).. .(I) 

Now if A has got 13 cards of the same suit, there are 4 
favourable ways, since the cards can be of any suit. Also A cun 
get 13 cards out of 52 cards in 62 C 13 ways. 

Hence P(A) = £=~. 

*M3 

After A has received 13 cards of one suit, 39 cards of 3 suits 
are left and for B to have 13 cards of the same suit, there are 
3 favourable ways out of 39 C 13 . 

Hence P (AB)— A- x . 

^13 *“"13 

Similarly P (ABC)=J xj x j 

*“"13 *" \ 3 *"13 

and P {ABCD)= t “ -x * x 3 x ' . 

*■"13 *"13 *"13 *"13 

Hence P (A + BA- C+D) can be obtained by substituting these 
values in the equation (1). 

3D. A five-figure number is formed by the digits 0, /, 2, 3, 4 
(without repetition). Find the probability that the number formed 
is divisible by 4. (Agra B. Sc. ’5->) 

The five digits can be arranged in 5 ! ways, cut of which 4 ! 
will begin with zero. Hence the total number of live-ligure numbers 
formcd = (5) ! — (4) !. 

The numbers formed will be divisible by 4 if the number 
formed by the two digits on the extreme right is divisible by 4, 
i.e . it should be 04, 12. 20, 24, 32, 40. 

The numbers ending in 04 = 3 !. 

The numbers ending in 12 = 3 !—2 ! 

= 4. 

The numbers ending in 20=3 ! 

= 6 . 

The numbers ending in 24 = 3 !—2 ! 

= 4. 

The numbers ending in 32 = 3 ! —2 ! 

= 4. 

The numbers ending in 40 = 3 ! 

= 6. 

(It may be noted that the number of those numbers which have 
zero in either of the two right-hand places eg. 04, 20 or 40 is 6 



122 


STATISTICS 


since the rest three digits can be arranged in 3 ! ways;-those which 
do not have a zero in the two right-hand places are (3! —2!), 
since we have to exclude the case of the numbers with zero on the 

extreme left.) 

The total number of favourable ways 
= 6+4+6 + 4 + 4+6=30. 

. 30 

The required probability j _l ~ 4 ~ T 

_30 

96 

__ 5 

16‘ 

31. What is the probability that at least one of the players in 
a bridge game will get a complete suit of cards ? (M. A. Delhi ’59) 

Since getting a complete suit of cards by a player are not 
mutually exclusive events, one, two, three or all the four may have 
complete suit of cards, the required probability 

P A 2 -\- A i ) = S i •^2 + ^3 

where denotes the probability that / players each have complete 
suit of cards, i= 1, 2, 3, 4. 

The probability that one player has a complete suit of cards 

^><4 

°1 b2f * 

'■'13 

since out of the four players, one player can be selected in i C x ways, 
moreover the selected player can have cards of any of the four 
suits which again can be selected in 4 ways. The total number of 
ways in which he or she can have 13 cards is 62 C 13 . 

c- „ <C 8 x4x3 

Similarly 02 v3o/“* • 

^13 A '-'13 

In this case, the two players can be selected in ways, the 
first player can get cards of one of the four suits and the second 
one of the three remaining suits. After giving 13 cards to one 
selected player, the second player can get 13 cards out of the 
remaining 39 cards in 39 C 13 ways. 

_ 4 C 3 x4x 3 x 2 

*^3 b'if' y 89f 

^13 ^ '-'13 A ^13 

4 C 4 X 4 X 3 X 2 X 1 
^ 4 “" 6i! C 13 x 39 C i3 x * 6 Cj j x I3 C 1S * 

The result can be obtained by simplifying S l —S 2 ^S 3 —S i . 


and 


PROBABILITY 


123 


32. A and B throw with a pair of dice. A whts if he throws 
6 before B throws 7 and B if he throws 7 before A throws 6 if A 
begins, show that his chance of winning is f". 

(M. Sc. Agra ’45, B. A. Hons. Delhi ’56. 60) 

Six can be thrown with a pair of dice in the manner— 
(1, 5), (2, 4), (3, 3), (4, 2) and (5, 1). Thus there are five ways of 
throwing a total of 6 with two dice and hence the probability is 
•ft, as the possible number of throws with two dice is 6x6, 
i. e. 36. 

Similiarly 7 can be thrown as (I, 6), (2, 5), (3, 4), (4, 3), (5. 2), 
(6, 1), i. e. in six ways; hence the probability of throwing a sum 
of 7 with two dice is ft or 

The game starts with a throw by A and his probability of 
winning in first throw is ft- and not winning is Now B gets 

a chance if A fails to throw 6 in the first throw and hence his 
chance of winning in his first throw is jjjxa. Similarly A gets 
the second chance if both A and B fail in their first throws, tne 
probability of it being IS x 5 and his probability of winning this 
chance is 1£ x § xft-and B's probability of winning in the next 
chance is |J x * x § £ x £ and so on. 

It may be noticed that A can win only in the first, third, 
fifth.. .thiows of the game while B can win in second, fourth, 
sixth.. .throws. Their respective probabilities of winning are 
given in the columns below : 


A * probability 

B's probability 

5 

3 » x ! 

36 

36 6 

3, x 5 x 5 

3l x 5xlix> 

36 6 36 

36 6 36 6 

?! *5 3i 5^ 

3I x 5 x 3 _!x 5 x 3I x' 

36 6 36 6 36 

36 6 36 6 36 6 

• 

an infinite series 

an infinite series. 


The probability of A’s win is the sum of all the probabilities 
in the first column. 

/3 


A’s chances 


~£[ 




i 1 * 5 
3b 6 









124 


STATISTICS 


r—i—i 

36 I . 155 

l 216 

__30 - m;. v S'; 

61* 

33. A, B and C in order toss a coin. The first one to throw 
a head wins . What are their respective chances of winning assuming 
that the game may continue indefinitely ? <1* A. S. ’55) 

As in the previous question, the respective chances of A, B, C 
are given by 


A 

B 

C 

h 


i x k x h 

(h)* 

(*) 5 

(i)° 

(*) 7 

(I) 8 

(h)° 

t • • 

• • • • 

• • • • 

• • • 

• • • • 

• • • • 

A’s chance= 

i [l + U) 3 +(4) 9 + ...] 




1 


2 l-(«» 

=*. 

Similarly B’schance=£, 

C’s chance=£. 

34. Four tickets marked 00, 01, 10, 11 respectively are placed 
in a hag. A ticket is drawn at random five times being replaced 
each time. Find the probability that the sum of numbers on the 
tickets thus drawn is 23. (M. Sc. Agra ’52) 

The total number of favourable ways is the coefficient of ** 3 in 
the expansion of (x°+* l + x 10 +* n ) 5 since this is the way in which 
can be obtained by multiplying x°, x l , x 10 , x 11 . 

Now*°+x+* 10 +* n =(l+*) (1+* 10 ). 

Hence (l+x-M'°+* 10 ) 5 =(l d+^°) 5 

= (l+5x-f- 1 Ojc* -f-1 Oa: 3 -h 5x 4 4- x s ) 

(1 +5^ 10 -f !10^°+...) 

Coefficient of X' 3 =100. 

The total number of ways in which the cards can be drawn = 4 s 


The required probability = 


100 

46 


PROBABILITY 


125 


35. A bag contains 6 white balls and 9 black balls. Two 
drawings of 4 balls each are made such that 

(a) the balls are replaced before the second trial. 

, (b) the balls are not replaced before the second trial. 

Find the probability that the first drawing will give 4 white and 
second 4 black balls in each case. (I. A. S. ’45, B. Sc. Agra ’61) 


(a) When the balls are replaced. 

Four white balls can be drawn in r ’C, ways while in the 
second draw four black balls can be drawn in 9 C 4 ways. The 
possible number of drawings in each case is lS C 4 . 

a C fl C 4 

The required probability = x 


— 6 oVr 


(b) When the balls are not replaced. 

Four white balls in first drawing can be drawn in e C 4 ways 
while four black balls in the second draw in °C 4 . The total 
number of possible draws in first case is 15 C 4 and in the second 
n C 4 since 11 balls are left in the bag at the time of second draw. 

6 C 4 ®C 4 

The required probability= l6 £ X 

_3_ 

— 715- 

36. In a purse there are 10 coins, all shillings except one which 
is a sovereign / in another are ten coins all shillings . Nine coins 
are taken from the former purse and pul into the latter and then 
nine coins are taken from the latter and put into the former. Find 
the chance that the sovereign is still in the first purse. 

Two cases are possible : firsly, the sovereign does not go from 
the first purse at all and secondlyfthe sovereign goes to the second 
purse and comes back. 

(i) The probability that the sovereign does not go to the 


second purse 


9 C y 

,u c 0 


x 1 = -A-. 


(ii) The probability that the sovereign goes to the second 

purse and comes back = -^ u ^r — x —y 9 8 ^-— 1 = Wo. the second 

term after multiplication sign denoting the probability of the 
selection of 8 shillings out of 18 and 1 sovereign. 

Since they are mutually exclusive events, the required 

probability=*j 1 0 - + iVo 



126 


STATISTICS 


37. A bag contains six balls of different colours and a ball is 
drawn from it. A speaks truth twice out of three times and B 
speaks truth 4 times out of 5. If both A and B say that a red ball 
has been drawn , find the probability of their joint statement being 
1rue • (B. Sc. Agra ’59) 

Let the probability of a red ball being selected be & and not 
being selected be f. 

P —1 P — 6 

r l~ e. *2 o• 

The probability of both speaking truth p x =f xf. 

The probability of both speaking false /> 2 = IX £ x 

since the probability that both select a white ball while it is 
not drawn is£x£as the chance of selecting one colour (white) 
out of 5 colours (not drawn) by one person is 

The odds in favour of the two hypotheses. /. e. both speaking 
true or false statements are P 1 p 1 to P 2 p 2 , 

*• *- A to ibt, 

i- e. 40 to 1. 

Hence the probability of .the statement being true=|?. 

38. If m things are distributed among a men and b women, show 
that the chance that the number of things received by men is odd is 

l ( b + a ) m -(b-a) m 
? (b+a) m ~ * 

(B. Sc. Agra ’57; M. Sc. Agra *53) 

The probability of one thing going to a man is and that 

a+b 

going to a woman is —y The probabilities of 0. 1, 2, 3... 

things going to men are the various terms having p*,p x ,p 2 . 

in the expansion of ( p + q) m where p= ^~ q=J?.. > 

a + b a + b' 

Thus the probability of r things being received by men out of 
m~ m C r p T q m ~ r . But men are to receive an odd number of 
things. 


Hence the required probability 



PROBABILITY 


127 


=--— { m C 1 ab m “ l -f m C 3 a 3 Z) m “ 3 + m C & a s b m “ 5 +...} 

( a+b) m 

= { b + a\ m —(b—a) m 
* (a + b) m 

39. Let p be the probability that a man aged x years dies in 
a year. Find the probability that out of n men, A x , A 2 , .. ,A n each 
aged x, A x will die and be the first to die. 

(M.A. Punjab ’58 I. A. S. ’54) 

The probability that one men dies is p and not dies is 1— p. 
Hence the probability that out of n persons, none dies in that year 
is (1 —p) T *. Hence the probability that at least one man dies in 
that year is 1— (1— p) n . 

Also the probability that out of n men, A is the first to die 

. 1 
is 

n 

Since they are independent events, the required probability 
is the product of two=^ (1—(1—/>) n }. 

40. Out of 3n consecutive integers , three are selected at 
random. Find the chance that their sum is divisible by 3. 

(M.Sc. Agra ’49, ’53) 

Let the numbers be m, m+ 1,.. .m + 3n— 1. Now they can be 
classified as 

m, #w + 3, m+6,... m- f 3 (/i— 1) 

1, m-f4, .. /n + 3« — 2, 

m + 2, m- f5, m-f-8,... w+3/»— 1. 

The sum of the numbers shall be divisible by 3 if either all 
the numbers are from the same row or all the three numbers are 
from different rows. 


The probability that the three numbers are from the same 
row is 3, n C 3 . Also the probability that the numbers are from 
different rows is nxnxn = n 5 as each number can be selected in n 
ways. 

The total number of selections of 3 numbers from among 3 n 
numbers is * n C 9 . 


The required probability 


n 3 -4-3."C 8 

3rt C s 

3n* — 2n + 2 
“(3 n— 1) (3n—2)* 



128 


STATISTICS 


41. A bag contains n balls one of which is while. The pro¬ 
babilities that A and B speak the truth are p and p' respectively. 
A ball is drawn from the bag and A and B both assert that it is 
white. What are the odds in favour of the statement being true ? 

The event is the agreement of both A and B in a statement. 
There are two possibilities : (i) it is true, (ii) it is false. 

In case (i) a white ball has been drawn and both A and B 
assert correctly that it is white. 

In case (ii) a non-white ball is drawn and both A and B pick 
vp the white colour out of the remaining n— I colours to assert 
falsely that a white ball is drawn. 


Now the probability of a white ball being drawn is - and a 

n 


non-white ball is 


n — 1 


n 


The chance that a white ball is drawn and both A and B 

rightly assert a white ball = - pp\ 

n 

The chance that a non-white ball is drawn and both A and B 
select the white colour for assertion wrongly from among the n—l 
remaining colours 


n ~ [ 1 .i » I 


n 


(!-/>') 


n-/>)(!-/>* ) 

«(n-l) * 

Hence the odds in favour of the statement being true 

= 1 p P - : “~ p ' > 

n n (n—l) 

=(n-l)pp’ :(\-p) ( 1 -/?'). 

42. In a bolt factory machines A, B, C manufacture 25, 35 
and 40 per cent of the total. Out of their output 5, 4 and 2 per cent 
are defective bolts. A bolt is drawn from the produce and is found 
to be defective. What are the probabilities that it was manufactured 
by A, B and C ? (M.A. Madras ’60, M.A. Delhi ’59) 


The chance that A produces a defective bolt=~»* similarly the 

chance that B produces a defective bolt is ~ and for C it is 

Now out of 100 bolts manufactured in the factory, 25 are 
produced by A. Taking the production in the factory to be 100, 


PROBABILITY 


129 


the number of defective bolts produced by A — x 25 

_5 

4* 

4 

the number of defective bolts produced by Z?=y^x35 

_7 

5’ 


and the number of defective bolts produced by C*=j^x40 

4 

~ 5 * 

the probability that the defective bolt is produced by A 

_ 25 

“69' 

Similarly the probability of it being produced by B 

i 

4 + 6 + 6 


_140 

~345 

80 

and that of being manufactured by C= 3 -^. 

43. The first twelve tetters of the alphabet are written at 
random. Find the chance that there are exactly four letters between 
A and D. (Agra B. Sc. 56) 

There are four possibilities. 

(i) Either of the letters A or D is at the beginning or the end 
of the word. 

(ii) Either A or B is the second or the eleventh letter of the 

word formed. 

(iii) Either A or B is third or the tenth letter of the word 
formed. 

(iv) One of the letters A and B is at the fourth and the 
other is the ninth letter of the word formed. 

Each of the first three cases has two alternatives and the 
last one has one type of arrangement. Also having determined 
the positions of A and B, there are 10 ! arrangements of the rest 



130 


STATISTICS 


of the letters and there are two arrangements between A and B, 
Hence there are in all 2 x 10 ! x 7 arrangements possible satisfying 
the given condition. 

The total number of ways in which 12 letters can be arranged 

= 12 !. 


Hence the required probability= 


2x10 ! x 7 
12 ! 




A coin is tossed (m+n) times (m > 


probability of at least m consecutive heads is 


n)i show that the 
n+2 

2 m +i* 


(Benaras M. A. ’45, Ind. Statistical Inst. <48) 

If the sequence of m consecutive heads starts from the 
beginning, i.e. the first throw is the head HHH.. .m times, the 
other throw may be ‘head’ or ‘tail* since we are considering the 
cases of at least m consecutive heads and the probability of this 


event=i. 


Similarly if the first is tail and the sequence of ‘heads’ starts 
from the second throw, the probability of at least m consecutive 


heads= 


1 

2 


x 


I I 

2 m ~ 2 m +i ’ 


If the sequence starts from (r fl)th throw r > 1, the first r 
throws may be ‘heads’ or ‘tails’, the rth throw must be tail and 
the next m throws must be heads. The probability in this case is 

1 1 __ 1 
2 x 2 m ~~ 2 m+l ' 


Hence the required probability^* (A-+ J-+.. times) 

n 

~2 m ^~ 2 m ' i 1 
_2 -\-n 

~ 2 m +i* 

45, Out of (2n-\-l) tickets consecutively numbered, three are 
drawn at random . Find the chance that the numbers on them are 

in A • p • (Agra M. Sc. ’50, Agra B. Sc ’59) 

If the lowest number selected is 1, the groupings may be 
1, 2, 3; 1, 3 5; 1, 4, 7;...l, «+l, 2// + 1, 
and they are n in number. 


PROBABILITY 


131 


If the lowest number selected is 2, the possible grouping are 

2, 3, 4; 2, 4, 6;.. .2, n-\- 1, 2 n, 
the number being n— 1. 

If the lowest number selected is 3, the groupings are 

3, 4, 5: 3, 5, 7; ..3, n+ 2, 2«+1, 

the number of groups this time being n — I. 

Similarly it can be seen that if the lowest numbers selected 
are 1, 2, 3, 4, 5,.. .2#i-2, 2n-1, the numbers of selections 

respectively are n, n— 1, n— 1, «—2, /i — 2... 1, 1. 

The favourable ways for 2, 3 being the lowest numbers sha! 

be equal and similarly for 4, 5 and so on. 

The total number of favourable ways 

= 2 {1 + 2 + \} + n 

n 2 

Hence the required probability = 2 - +I ^ 

^3 


~4 w*— r 

46. Eight mice are selected at random from a large number 
and then divided into two groups of jour each : group A and group B 
Eich mouse in group A is given a dose *a’ of certain poison which 
is expected to kill one in four. Each mouse in group B is given a 
dose *b' of another poison which is expected to kill one in two. Show 
that nevertheless there may be fewer deaths in group B than in 
group A and find the probability of the happening. 

(I. A. S. 53, Alld. M. Com. 54) 
There can be fewer deaths in group B than in group A in the 
following ways :— 

Deaths In group B Deaths In group A 


(i) 

<i») 

(Hi) 

(iv) 


0 

1 

2 

3 


1. 2. 3, 4 
2. 3. 4 
3. 4 
4. 


For A , p = {, q = \ if p stands for death and q for survival. 

For B,p = l,q=*k. 

For group A , the probabilities of 4, 3, 2, 1,0 deaths are given by 

(i 3\*_1 12 54 108 81 

M + 4j “256 + 256' f 256 + 25<> + 256' 

For group B t the probabilities of 4, 3, 2, I, 0 deaths are 

/I l\* 1 4 6 4 l 


132 


STATISTICS 


The probability of (i) i.e. 0 deaths in group B and 


deaths in A 


= li! 54 1081 

16\256+256 + 256+256J 
175 

”4096* 


1. 2. 3, 4 


The probability of (ii) is 

_±/J L. 21 541 

— 16 \256 + 256 + 256J 
268 

“4096 * 

The probability of (iii) is 

6_ /_L , 121 78 

16 ^256^”256J "~4096* 

The probability of 4th event is 

4 1 4 

16 X 256~4096‘ 

The required probability 

175 268 78 4 

“ 4096 + 4096+4096 + 4096 
_ 525 
“4096* 


47. A speaks the truth in 75% cases , and B in 80% of the 
cases. In what percentage of cases are they ' likely to contradict 
each other in stating the same fact. (Punjab B. A. *58) 

They will contradict each other if one speaks the truth and 
the other does not. 


The probability that A speaks truth and B tells a lie 

= i x 6 = "SO* 

Similarly the probability that B speaks truth and A does not 

Hence the required probability 

= To'- 

48. A pack of cards is counted, face downwards , and it is found 
that one card is missing. Two cards are drawn and are found to 
be spades . What are the odds against the missing card being a 
spade ? 

There are two possibilities : 

(i) The missing card is a spade, - . 

(ii) The missing card is not a spade. 


PROBABILITY 


131 


If the lowest number selected is 2, the possible grouping are 

2, 3, 4; 2, 4, 6;.. .2, n+ 1, 2/7, 
the number being n— 1. 

If the lowest number selected is 3, the groupings are 

3, 4, 5: 3, 5, 7; . .3, n+2, 2n + 1, 
the number of groups this time being n— 1. 

Similarly it can be seen that if the lowest numbers selected 
are 1, 2, 3, 4, 5,.. .2/7— 2, 2n— 1, the numbers of selections 
respectively are n, n— 1 , n— 1 , n— 2, n — 2... 1, I. 

The favourable ways for 2, 3 being the lowest numbers shall 
be equal and similarly for 4, 5 and so on. 

The total number of favourable ways 

= 2 {1 + 2 + ...+// —• !} + // 

_ „2 


Hence the required probability = 


if 

*S+i C 

3n 


3 


4/7*-1 


46. Eight mice are selected at random from a large number 
and then divided into two groups of jour each : group A and group B. 
Eich mouse in group A is given a dose *a of certain poison which 
is expected to kill one in four. Each mouse in group B is given a 
dose *b ’ of another poison which is expected to kill one in two. Show 
that nevertheless there may be fewer deaths in group B than in 
group A and find the probability of the happening. 

(I. A. S. 53, A11 d. M. Com. 54) 
There can be fewer deaths in group B than in group A in the 
following ways :— 

Deaths In group B Deaths in group A 


(i) 

(ii) 

(iii) 

(iv> 

For 


0 

1 

2 

3 

A fP=i, (1 = 1 


1, 2. 3, 4 

2. 3, 4 

3. 4 

4. 

if p stands for death and q for survival. 


For B, p = \ t 4 = + 

For group A, the probabilities of 4, 3, 2, I, 0 deaths are given by 


(i 3\ 4 __1_ 12 54 108 81 

M + *U “256 + 256 + 256 4 '256 + 256‘ 

For group B , the probabilities of 4, 3, 2, 1, 0 deaths are 

/I l\* 1 4 6 4 I 



134 


STATISTICS 


in random order , how will the probability of correctly judging with 
every cup on the null hypothesis be altered ? 

Which of the two designs would you prefer and why ? 

(I. A. S. ’49) 

(i) When all the cups are presented to the lady in random 
order :— 


„ The number of permutations of 12 things, 6 being of one 
kind and 6 of the other, is ^"T^l anc * there *s only one way in 
which she can judge the cups correctly. 

Hence the probability= 

0 ! 6 ! 


(ii) When the cups are presented to her in pairs , the pro¬ 
bability of judging correctly each pair is £. Since all the cases 
are independent, the probability of her judging correctly all the 

6 pairs = (‘) 6 =I. 


The first method is preferable since the probability of her 
judging correctly on null hypothosis (she has no power of discri¬ 
mination in cups) is much less than in the second case. 

52. Cards are dealt one by one from a well-shuffled pack until 
an ace appears Show that the probability that exactly n cards 
are dealt before the first ace appears is 

4 (51-n) (50—n) (49-n) 

52.5C50.49 


(Delhi Hons. ’55, I. C. A. R. ’50) 

Chance of not drawing an ace in first draw is £f, not 
drawing in second draw is and so on, so that the chance of 

not drawing an ace in the wth draw is ^8 —(/*—!)_ _ 49—n 

52-t/i-D 52—(n —1) 

and the probability of getting an ace in (n+ l)th draw is . 

52 — n 

Hence the required probability 


_48 47 46 52 -n 51— n 50 —n 49-n 

52*51 50* * ’52 —(« —4)* 52 — (7i — 3) * 52-(n-2) * 52- (n- 1) 


4 (51— w) (50— n) (49 — n) 
52751.50 .49 




PROBABILITY 


J35 


Note. Every numerator in the product is the same as the 
denominator in the fifth following fraction. 

53. A pack of 52 cards is dealt to four players as in a game 
of bridge. One of the players did not get an ace. in three conse¬ 
cutive games. Has he reason to complain of ill luck ? (Madras ’55) 

The probability of getting no ace by a player is 48 C I3 / 52 C, 3 
since he can get 13 cards out of 48 (four aces excluding) in 48 C 13 
ways and the total number of ways in which the player can get 
13 cards out of 52 is 62 C 13 . 

The probability of not getting an ace in three games is 
( 48 ^i3/ 62 £\a) 3 since these are independent events 

The required probability =/^- X x 37 * 36 V 

\52x 51 x 50x49/ 

= (•304)3 


• ='028 nearly. 

Hence he has a cause of complaint since a rare event of the 
small probability *028 has happened. 

54. The sum of two positive quantities is equal to 2n Find 
the chance that the product of two quantities is not less than $ 
times their greatest product. 

The sum of the quantities is 2 n and their product is maximum 
when they are equal, e. n 2 . 

If the quantities are x and 2 n—x, 


i. e. 


x (2 n-x) > {n\ 
(2x-3n) (2*-/j) < 0. 


so that x must lie between ? n and so 
Q n ~~^) va,ues to satisfy the given condition. 


that x can take 


Hence the required probability = 


3 

2 



- 4 . 

55. A and B play a match to be decided as soon as either has 
won two games. The chance of either winning a game is -„V and of 
its being drawn What is the chance that the match is finished 
in JO or less games ? 

If the match is not finished in 10 games, either of the following 
events occur :— 

(i) All the games arc drawn. 



136 


STATISTICS 


(ii) A and B each win one game and the rest 8 are drawn. 

(iii) A or B wins one game and the rest nine are drawn. 
We find out the corresponding probabilities : 

(i) (A) 10 - 


.••v o /_JL_\8 tor 

since "the whining chances can be selected in 10 C 2 ways and they 
can be permuted in two ways. 

(iii) 2. 10 C 1 (i 9 o) 9 (*a l o)» . i0 „ . ... 

since the winning chance can be selected in C x ways and either 


A or B wins. * 

Since these are mutually exclusive events, the probabdhy ot 

the match not being finished in 10 games is their sum= 2x jo 10 ' 


The probability that the match is finished in 10 or less number 


08 x 3«7 

of games = 1 — 2 x i '0 r ° ==17 nearly * 

56. In the circuit in the adjoining figure , 
w hat is the probability that the bulb will be lit 
(i. e. the circuit closed) given that it is equally 
likely for any of the switches A, B. C, D to 
be open or closed ? 



The bulb will be lit if both the switches A and B are closed 

or if either switch C or D is closed. The desired probability is 

P (A and B or C or D) = P (. AB)+P ( C) + P (D)-P (ABC) 

— P (ABD)-P (CD) + P ( ABCD). 

Since P (A)=P(B) = P(C)=P (/>) = *, 

P (AB) = P ( CD) = \ t 

P (ABC)=P (ABD) = h, 

P (ABCD)= -A-. 


Hence the required probability 

57. In the previous example , given that the bulb is lit, 
calculate the probability that both switches A and B are closed. 

The probability of both switches A and B being closed is £ 
and that of the bulb being lit=i|. 

Hence the reqd. probabi!ity = ^ 

1 O 


_4. 

— 13 * 


58. n letters to each of which corresponds an envelope are 
placed in the envelopes at random. What is the probability that no 
letter is placed in the right envelope ? 



PROBABILITY 


137 


Suppose u n is the number of ways in which all the letters 
are put in the wrong envelopes. Consider any particular letter 
and suppose it is put in the envelope of any other letter and vice 
versa. Clearly this can happen in n— 1 ways since this exchange 
of envelopes can take place with any of the remaining n— 1 
letters, so that such a possibility is (n— 1) w n _ 2 as the other n — 2 
letters can go wrong in i/ n _ 2 ways. But if the letter occupies the 
envelope of any other letter and not vice versa, then it can happen 
in (n— 1) M n _j ways, since this letter can occupy any of the //—1 
envelopes (excluding its own) and the other n —1 letters can go 
wrong in ways. Hence we have 

W„ = («—1) («n -2 + «n-l) 

or u n —nu n _ l = —{u fl - J — (n—\) u n - 2 }, 

which is a difference equation. Putting u n —//« n _ 1 = v n , we have 


v n = — v n -i 
= (_l)n-« v 2 

°r w n —/»/„_!=(— l) n-2 («*—2u,}. 

But Wi=0, since one single letter is bound to go to its own 

envelope and u t = 1. 

M n -«W n _ 1 = (-l) n - > 

= (-!)" 

/ i v*> 

so that m 2 —2w, = ( — l) 2 or = 


2 ! ’ 


3 



M a—3 u 2 =(— l) 3 or = 

W4-4w 3 = (-1)* or 


u n -tiu n _ x ={-\) n or n / . 


Adding, we get u n =n ! y\ + j~r • • s,nCe M|_0 ‘ 

But the total number of ways in which n letters can be placed 
in n envelopes is /» ! 


Hence the required probability = 


/i ! 


1 ‘x 1 1 4. + <“ I)n 

“ Of • *" 4 ! ~” 5 ! ^ •••< n 


2 ! 3 !~4 ! 5 ! 


or the first (/1 — 1) terms of e ~*. 




138 


STATISTICS 


59. If two integers A and B t B > A, are selected at random, 
what is the probability that they have no prime number as a common 
divisor ? 

If A is divided by p, the remainders can be 0, 1, 2,.../? — I 
and hence ’he probability that the remainder is 0 i. e A is divi¬ 
sible by p is 

p 

The probability that B is divisible by p is again ^ and hence 

the probability that both A and B are divisible by p is 

the probability that neither A nor B has p a common 
factor = 1 — 

p- 

Hence the probability that no prime number is a common 
factor in A and B is 

P=p i y.p i y.p 5 y.p 7 

where p n is the probability that n is not a common factor in A 
and /?, « —2, 3, 5, 7,... 

Hence P =(\- 1 “*)( ! ~ 72 )* • • 00 


6 



(i0. A and B have equal chances of winning a single game. 
A wants n games and B y n + 1 games to win a match. Show that the 

odds in favour of A are 7 -fP to 1 -P. where P= — — 

n ! n ! 2- n 

A has to play at the most 2/; games to win the match, the 
extreme case being in which he wins the last game and n — 1 of 
the other games. Out of the In games, A must win at least n 
games to win the match. If A’s chances of winning are denoted 
by p and that of B by <7 where p=q — h, the probabilities of 
0 , 1 , 2 ,... wins by A are given by the terms of the expansion 

(q+p)* n =q 2a P°+ tn C t q*»-'p +... + *"C,q* n ’ r p '+... + p*\ 


or 


The probability that A wins at least n games is 

2n C n q n p n A-' n C n+l q n ~ l p n+1 -f-... + 2 "C 2n />*«. ... (I) 

Now 2 "C 0 + -"C 1 +...-t- 2 n C n _ 1 + 2 "C n + 2 n C„ n -f 2 n C f , +2 +... 


+ 2 "Co„=2*" 


2,1 C„ + 2 {-"C ntl + 2 "C n+2 - \~. .. + 2 "C,„} = 2 2 ". 
2 "C„ + 1 +="C, t+2 + ... + = i {2 


PROBABILITY 


139 


Substituting this value in (I), the probability of A's win 



l* m C n + i 


{2*---C.}] = i+i.i s *”C„=i (1 + P). 


The probability of A losing = 1 —| (1-f P) 

= h (!-/>). 

Hence the odds in A's favour are 1 — P to 1-f/*. 

61. If n biscuits are distributed at random among N beggars , 
what is the chance that a particular beggar receives r(<n) 
biscuits ? (Punjab B. A. ’53, ’£4) 

The particular beggar getting r biscuits, the rest n-r biscuits 
can be distributed among N— I beggars in (jV— l)' ,-r ways since 
each biscuit can be distributed in (iV — 1) ways Also the r 
biscuits can be selected out of n in "C r ways. Since they arc 
compound events, the required probability 

„ "C r (AT-1)-- 


N n being the number of ways in which n biscuits can be distri¬ 
buted among N beggars. 

62. A player tosses a coin and is to score one point for every 
head turned up and two for every tail. He is to play on until 
his score reaches or passes n. If p„ is the chance for attaining 
exactly n, show that p n —\ (Pn-i+Pn-z) and hence find the value 

°f P«' (Punjab B A. ’56) 

There are two ways of attaining n points, either he throws 
a head when his score \s n— lor a tail when his score is « —2. 
These are mutually exclusive events. We have 


Pn h 'Pn— x~b £ -Pn-2 
= h (Pn-l+A,-*). 

Rearranging the terms, 

Pn~f~ hPn -1 ^Pn-\ “l” J Pn -2 ~Pn -2 4” hp»— 3~ • • • — Pi 4” hPl» 

Now /?, = $ and /’s =s }+ 4 •$=*!» since p 2 can be attained 
either by throwing tail in the beginning or two heads successively 
in first two throws. 


Hence p n + kp n - t ... = I 

Pn-\**—h (Pn-l-V ^ 

Pn-l— 3 = —i ( Pn-i-i), 
Pn -2 — i = — ■> ( Pn -3 ~ a h 
••• ••• ... ... ^ 

••• ... ... ... [ 

P>-l = -UPt-\\), I 
Pt ~ 3 = — i iPi~ J 



140 


STATISTICS 


Multiplying equations (1), we have 

= (Pi-& since p x = h 

or A,=i{2+(-l)".^}. 

6*10. Variate or a random variable (or a chance variable) is 
a variable which takes a definite set of values with a definite 
probability associated with each value. 

6*11. Expectation of a variate. When x is a discrete variate 
wfiich may taKe n mutually exclusive values x, (i=l, 2,...n) 
and no others, with respective probabilities p, (i—1, 2,.. .n), the 
expectation of x 

E (x) =p,x x +p 2 x 2 +... +p n x „ 

n 

= 2 PiXi . 

i - 1 

When x is a continuous variable, the expectation of x is 
given by 

E ( x ) = ^ x <f> (x) dx, 

where <f> (x) is the probability density defining the function. 


In general 



0 (x) 0 (x) dx. 


6*12. Relative frequencies and Probabilities. If the variate 
x takes the value x< with frequency f and the total frequency 

Zfi = N, then ^ is known as the relative frequency of the value 


Xf. We have already seen that according to the statistical or 
empirical d finition of probability, the limit of the relative 
frequency if it is unique and finite is termed as the probability 
of x { Now if the data given to us are regarded as a random 
sample which implies that the selection from the population 
has been made in such a manner that each item has equal 
chance of being selected, we have no reason to suspect that 
other samples drawn under the same conditions would 
give different results and hence generally the relative frequency 
is taken as the measure of the probability of a value of the variate. 
The theorem due to Bernoulli is helpful in understanding the 
concept of probability. 


Let e.and r t be two given positive numbers, however small, 
and let in' be the number of successes in n independent trials in 


PROBABILITY 


141 


which the constant probability of success is p. Then the probabi- 
lity that the inequality 



will hold is greater than 1-t } provided n is greater nvi a 
certain number N depending on c and q. (Weatherhum) 

Now for a discrete variate, 

n 

e (x)=r p { Xi. 

i 

Here p { takes the place of the relative frequency 
Hence £ ( x ) = l l i 

Thus E (x) represents the mean of the distribution. 

Similarly £ (x 2 )=£ P(X* 

l 

= IXo'• " 

The variance o* = £ (x 2 ) — [E (x )] 2 

= E (x—S) 2 =£ (x — E (x)} 2 . 

In general, // r = £(x r ). 

The same results hold good for continuous variates. 

Theorem of Expectation of a Sum and Product. 

The mathematical expectation af a sum of a number of variable ? 
is equal to the sum of their expectations. (B. Sc. Agra’ 63) 

Let us consider the random variables x and y. Let x assume 
the values x< (/=1, 2. 3,...m) and y the values y t (j = l, 2, 3'•••"> 
with respective probabilities pi anJ py Thesumxfy is a ran¬ 
dom variable which can take mn values 

*= 1 » 2 » • - m » 
j= 1 , 2 , 3... .n 

with probabilities P if . Hence its expectation is 

E t x +y)=Z Z (Xi+y*) p a 

j~\ « = l 

n m n m 

=27 E XtPil + Z Z yt Pi} 

j~\ J=I« = 1 

= 27 x ( (Z Pi,) + Z yj (Z Pif) 
i j j ' 

= 27 x< Pi-\-Z yi Pi 

i) 



142 


STATISTICS 


=E(x)-\-E (y), 

since 2 P represents the probability pt of x assuming the 

./ ' * \ < 

n m 

value x { so that 2 Pu—pi. Similarly 2 P f j=p /. 

j=\ i=l 

By generalisation of the above theorem, we have 

E ... +x n )=E (xJ+E (x t ) + . .. + E (x n ). 

Expectation of the Product of Random Variates. 

The mathemat ical expectation of the product of a number of 
random variates is equal to the product of their expectations, 

Using the above notation, 

E(xy)=2 2 x iPi y iP f . 

• • 
j i 

Since the variates are independent, by the law of compound 
probabilities, 

2 Xip i y j pf'=2 x iPi 2 y )P { 

• • 

* J 

=2 Pi x { E {y)—E (>«) 2 p { x { 

I 

= E ( x).E (>0. 

The theorem can be generalized for a number of independent 
random variates. 

E (aTj, a o..., Af n )=E (x x ). E Oj,)... E (x n ). 

6*14. Solved Examples. 

1. A and B throw with one die for a prize of Rs. 11 which is 
to be won by a player who first throws 6. If A has the first throw, 
what are their respective expectations ? 

(M. Sc. Agra ’52, B. Sc. Hyderabad *46) 

The chances of throwing a six are as follows : 

A B 

o gx£ 

(S) 3 .£ 

(I) 1 .* (t) 6 .£ 


A’s chances of success = |-f(| ) 2 . £+($/* Jf... 

1 1 

— .9. 



PROBABILITY 


14 3 


B’s chances of success = A* 

A’s expectation = vf x 11 

— Rs. 6. 


B’s expectation = vVx 11 

= Rs. 5. 

2. From a bag containing 2 sovereigns and 3 shillings, a 
person is allowed to draw 2 coins indiscriminately. Find the value 

of his expectation. ( A S ra M. Sc. ’56) 

‘Co I 

Probability of drawing 2 sovereigns = 


Probability of drawing 1 sovereign and 1 shilling 

g Ci x 3 C , 3 
" b C 2 “5* 


Probability of drawing two shillings = 


3 C, 

6 Co 


3 

lo' 


The values of the variates in the three cases are 40 s., 2! s. 
and 2 s. respectively. 

Expectation=(40 XyV) + (l * 21) + (Yo) x 2 = 17J s. 

3. Two players of equal skill A and B play a set oj games: 
they leave off playing when A wants 3 points an l B wants 2. IJ the 
stake is £ 16, what share ought each to take ? (M. Sc. Agra ’48) 

A wins if he can score three points before B scores two. II 
we denote the scoring of a point by A as w and scoring by B as I, 

the favourable cases to A are 


w, w t w; w, l, w, w; w, w, l, w; l, w, w, w 
where w denotes the winning and / the losing of a game. 

The probabilities are £, (|) 4 , (A) 4 and ($) 4 respectively as the 
probability of either a win or loss by A is 


The chance of A’s success = $ + ■& = i 5 o- 
The chance of B’s success=}J. 

A should take £ 5 and B shall have £ 11. 

4. A bag contains a coin of value M and a number of other 

coins whose aggregate value is m. A person draws one at a time 

till he gets the coin M. Find the value oj his expectation. 

(M. Sc. Agra ’53) 

Let the other n coins be of values m x ,m 2 , ...m n so that 

m x •+ /»i*+ • • . + w n =w. 

Since there are «+ 1 coins, the probability of drawing M in 


the first attempt is-v which is also equal to th? probability of 

r n 4-1 



144 


STATISTICS 


drawing any other coin say other than M. Hence the value of 
expectation in the first draw, 

F - M 4- 4- ™ 2 _L i W » 

. «+ _r. • * “"“T • • • + 




n + 1 n+1 #i+l 
= -^ri \m+* m}. 

w +1 l n \ 

He gets the second chance if he fails to draw M in the first 


n 


attempt, the probability of this event being The probability 

of having to make second draw and drawing M in it is 
n 1 


//+! n 


Also if a coin rtu is drawn in the first attempt, the 


probability of drawing another coin mf where i^=j and m^M is 
—-j The value of his expectation in the second draw is 


r __ n 1 ' 1 

n+l n 


1 n n 

2 2 m\ 


H+l n I=I>= i 
■Vj 


-JL + 

a « • 


1 1 " 

... —= - Z ( rn—nji) 

/z+1 n+ 1 n i=l v 


1 


= (w— 1) m 

n+1 (ai+1)#i v 7 

= ,7Tl{^ +? T ln, i- 

Similarly the probability of having to make the third draw 

and getting M in it is —r. Now suppose having 

#1 + 1 n n — 1 

drawn m { in the first draw, nij is drawn in the second and let m k 
be drawn in the third, i=£j^:k and The value of the 

expectation in the third draw is 


r n n —11 

Z 9 = —:-r.--Af + 

J «+l n n—l 


1 1 1 


n 

Z 


n 

z 


(#l+l),#i#i-l i=7 jc= \ k= i 

ir*j y k^ij^j 


n 

Z m k 


n 


n— 1 1 


n+1* n 'n—l 


A/+ 


1 1 1 


#!+ r#r#i-,- = i 


n n 

Z Z {m—nii—nij} 



PROBABILITY 


145 


- " + 


1 


n 


n+l (n+ 1) n (n— D j= \ 


2 {(/i— l)m —(w— nij) — (n — l)w;} 


= + 1 


n 


„+,+ („+!) ««-» m « 


A/ + n — 2 


n+l (//-+-1) n (n— 1) 

—2 


(a— 1 ) m 


=-i-. \M+ n — m) 

n+l\ n j 

and so on, so that the value of the expection in the last possible 
draw /. e. (rt+1) st. is 


En +1 = 


n+l 


{A/}. 


The total expectation is 


r , r I p w, w (1 + 2+3 + ...w) 


= A/ + 




Aliter. Suppose all the coins other than M are 

m 


of equal 


denomination /. e. 




The probability of drawing A/ in the first attempt is 

-—r and a coin other than M is ~~ • 
n-f-1 n+l 

_ Af.m 

Expectation in the first attempt + j • 

He makes the second attempt only if he fails to draw M in 
the first attempt, the probability of this event being -- j • Now 
after the first unsuccessful draw, only n coins are left of which one 

is At and n—\ of the denomination m each. The probability of 

n 

drawing M in the second attempt is ~ and a coin other than A/ is 
n -1 

H 



146 


STATISTICS 


TT r n (M .n— 1 ml 

Hence E 2 =——, < — H- .->• 

«+l n n J 

i (-3- 


__ M , m 

Similarly the third attempt is made if he fails to draw M in 

n ^ 

the first two attempts, the probability of which is . Also 

the probability of drawing M in the third attempt is ^-y and a 


coin other than A/ is 


ti—2 


n— 1 


TT n b- 1 f Jlf . b- 2 ml 

Hence - <—.+— 

rv (w —1 n — \n\ 

rt+1 « +1 V n)‘ 


• • 


• • • 


• • 


c . r M in (. ii— 1 \ 

Similarly £,= 7^ + -+, (l- „ ) 


and 


F =*L 

n+l n+1 


In all, w + 1 attempts are possible since initially there are only 
H-f 1 coins. 

His expectation = E l + E 2 + E 3 -f... + E n 4- E n +, 


=M+J^ Z (l-^L-T) 

«+l r = i V n J 


w . m ( n 
= M+ n +1 


(n-1) 
In ^ 


} 


i m . m 

= A/+-. 


5. A makes a bet with B of 5 s. to 2 s. that in a single throw 
with two dice , he will throw 7 before B throws 4. Each has a pair 
of dice and they throw simultaneously until one of them wins , equal 
throws being disregarded. Find B’s expectations. (Agra M. Sc. ’60) 
Seven can be thrown in six ways, 

(6. 1). (5,2), (4, 3), (3, 4), (2, 5), (1, 6). 

Probability of B throwing 7= 

3o n 


PROBABILITY 


147 


Similarly four can be thrown in three ways, 

(3, 1), (2, 2), (1,3). 

.*. Probability of throwing 4=^p = 

36 12 

Probability of B's throwing neither 7 nor 4 

=,_f _ !_=3 

6 12 4* 

Now A wins if he throws 7 but B throws neither 7 nor 4. 

13 1 

A 's chances of winning in first throw=^x- = . 

6 4 8 

£'s chances of winning in first throw=^r x }- . 

12 4 16 

Probability of none winning in first throw= 1 — 1 . 

O 1 O 

Probability of A winning in second throw* — g — 

13 \ 

16*8* 

Probability of B winning in second throw* 

16 16 

Similarly probability of A *s win in third throw*. *. 

i 

Vl6 J 16 


1 

8 


And probability of B's win in third threw 

Thus we get an infinite geometric series. 

A’s chances = ! + m ~ 4 .(—V * + 

8 + 16 8 ' \l6y *8 + *•* 

1 I 2 


1- 


13 3 
16 


B’s chances 


1 

3* 


A gets - x 2 s. if he wins and pays 5 x * s. to B if he loses. 
B’s expectation 


-G-S)- 


= 4 d. 

Aliter. Clearly A’s chance in each trial is double of B’s. 
Let B’s chance of ultimaiely winning the game be x. Tnen A’s 
chance = 2x. 

.*. x-f-2x=l i.e. x=*)f etc. 



148 


STATISTICS 


6. What is the expectation of the number of failures preceding 
the first success in an infinite series of independent trials with 

constant probability of successes. 

(B A. Hons. Delhi ’61, M.A. Madras ’59) 

The probabilities of success in 1st, 2nd, 3rd... .trials respec¬ 
tively are 

p. qp, q ? P , • • 

The expected number of failures preceding the success 

E {x)=0.p+\.qp + 2.q 2 p + ... 

= qp {1 + 2 ( 7 + 3 <?*+...«>}, q < 1 


= 1 \-_p 
P 

7. A, B, C, D cut a pack of cards successively in the order 
mentioned. If the person who cuts a spade first receives £175* 
what are their expectations. (M. Sc. Agra *44) 

A *s chances of cutting spade in first attempt—^. 


B gets the chance if A fails and hence the probability of his 
getting the chance and succeeding=? . 

Similarly C’s probability of getting the chance and succeeding 



3 x! 

4 4’ 


Similarly D's probability of getting the chance and succeeding 



A gets the chance if all of them fail in first attempt and hence 

his probability of success= 

A can succeed in 1st, 5th, 9th.. .draws and the probability 
of his winning 




PROBIBALITY 


149 


Similarly B's chances of success are given by 


64 = 

175 4 175* 


C’s chances of success 

=G ) 2 Hl)‘ 


mo 8 


i 




48 3__ 36 

175 4 175* 


D's chances of success 

=G ) 3 • HU ■ UQ) 


11 


!+••• 


36 * 3 = 27 
175 4 175* 


(M. Sc. Agra ’60) 


Their respective expectations are 


^’8=^X175 = £64, 

Z?’s= 48 , X 175 = ^48, 

i /J 

C'S=^X175=£36, 

0’s = p^x 175 = ^27. 

Note It may be noted that £’s chances of success and 
expectations are! times those of A, C’s 3 times those of 5 and 
so on. 

8. A person draws 2 balls from a bag containing 3 white and 
5 black balls. If he is to receive 10 s.for every white ball which he 
draws and Is. for every black ball, what is his expectation ? 

The number of ways in which 2 balls'can be drawn = 8 C 2 = 28. 

The number of ways for 2 black balls=» 6 C 2 = 10. 

The number of ways for 1 black and 1 white ball 

= 6 C| x 3 C, = 1 5. 

The number of ways for 2 white balls = 3 C a = 3. 

. „ . 10 15 . 

The respective probabilities for above events are ^ ^ an “ 

~ and the amounts they receive for them are 2 s., II s. and 20 s 
28 

respectively. 


150 


STATISTICS 


Hence the expectation=^2x x + ^20x 


= 8 s. 9 d. 


9. A and B play for a prize of Rs. 324. A is to throw a 
die first and is to win if he throws 6. If he fails, B is to throw and 
is to win if throws 6 or 5. If he fails, A is to throw again and 
to win if he throws 6 or 5 or 4 and so on. Find the respective 
expectations. 

Probability of A's throwing 6 and winning—^. 

Probability of A’s not throwing 6 and B's throwing 6 or 5 

= ? x ? = l 
6 6 18* 


Probability of A 's winning in the third throw 



3 __5 
6 18* 


Probability of B's winning in the fourth throw 



4 

6 



3* 4 = 5 

6 6 27- 


Probability of A’s winning in the fifth throw 

5 4 3 2 5 25 

6 6 6 6 6 324 

Probability of B's winning in the sixth throw 

5 4 3 2 1 6 5 

= 7 X-X;X-X-X =—. 

6 6 6 6 6 6 324 

(Note that there cannot be more than six throws in this game) 

n u u-x-i r a* • • 1,5,25 169 

Probability of A s winnmg = 2 +=—. 

6 lo l JZ + 

Probability of B's winning—= 


A's 

B's 


expectation = 
expectation = 


169 

324 

155 

324 


x324 = Rs. 169. 
x 324 = Rs. 155. 


10. A person draws cards one by one from a pack until he 
draws all the aces. IIow many cards he may be expected to draw ? 

(I. S. I. Calcutta Ml) 



PROBABILITY 


151 


Suppose he has to make n draws for all the aces. It means 
that in n— 1 draws, he draws three aces and in the nth, one acc. 
The probability of such an occurrence 

_ 4 C 3 x 48 C„_ 4 ^ 1 

52—(ii— 1) 

_4 x48 ! x(/?— I) ! (52—/i-f-1) ! ^ 1 

(«-41 ! (52— n) ! x 52 ! * S2-n+l 

_ 4 (n — 1) (n — 2) (n- 3) 

49 x 50 x 5 l x 52 * 


The least number of draws be has to make is 4 and the 
maximum number 52. Hence n ranges from 4 to 52. 

The expected numbs r of draws 

= j? 4 («— Dfa-2 ) (n-3) 

„=4 ’ 49 x 50 x 51 x 52 

4 f -2 52 52 52 * 

=49X50X5! X 52 {„£/' 6 f «’ + " * »*-6 2’ 

11. A ami B play a game of tossing a coin and he who first 
throws an H wins the game and the game terminates. If A I eg ins 
the game and each player wins an amount of money equal to the 
number of tosses required for the win, find their respective mathe¬ 
matical expectations. 


being 


A can win in first, third, fifth.. .throws, the 
1 1 I 


2 9 2 3 1 2 6# * * 


probabilities 


Hence A's expectation = * ' * * 

This is infinite arithmetico-geoinetric progression whose sum 

. 10 
IS . 


Similarly B's 


. 2 . 4 . 

ex pecta 1 1 on = ^ + -, 4 + 





A p!a)er throwing an ordinary die is to receive 



n is the number of throws that he takes to throw the first 6. 
his expectation. 


when 

Find 


152 


STATISTICS 


The probabilities of throwing six in first, 2nd, third... 
attempts respectively are 



Hence his expectation in rupees 

11 5 11, /5\* 11, .(S\ 9 1 

= 6*2+6‘6’2 2+ U) *6 X 23 + , “ + Uy ‘6* 


1 

2 »+x 


+ ••• 


_1 'I 

-6-2\ 


1 + U + 


(e 4 ) + "} 




13. A man throws a six-faced die until he gets an ace. He 
is to receive fl if he succeeds at the first throw , 10 s. if he succeeds 
at the second* throw , 6 s. 8 d if he succeeds at the third throw and 

so on. 

Given log 2='3010300, 

log 5=* 4771213, 
log 2’718282= *4342945, 

find the value of his expectation to the nearest penny. 

He gets £ 1 if he succeeds in first throw, £ | if he succeeds in 
second throw~ £ i if he succeeds in third throw and so on. 

His expectation is 

i . . 5 i i . /5Y i i . r 5 \ 9 2 2 

4 


6 • 1 6’ 6* 2^" 


» Ll 1 J 

/5V 1 1 /5\» 1 
W -6‘3+W "6" 

i+ M + (D -s'*—} 


+--- + (D (Dv+i + - 


• • 


-H , , , 

+ KD +- "} 

1 , , \ logio 6 1 logi o 2-f login 3 

-5 l0£ > 6= log,.. ' 

-st : 


5* log l0 e 

3010300 4- ‘4771213 | 


1 . 


•4342945 
= 7 s. 2 d. nearly. 

Exercises 

Obtain the probability that the birth-days of seven people 
will fall on seven different days of the week, assuming equal 

probability for the seven days. [ 7 ® J 


PROBABILITY 


153 


2. From a pack of 52 cards, two are drawn at random. Find 
the chance that one is a king and the other a queen. 

(Agra M. Sc. 57) r.S»] 

3. Six cards are drawn at random from a pack of 52 cards. 

What is the probability that three will be red and three black? 

(Agra B. Sc. 56) [£$?£?] 

4 . There are 30 students in a class. Evaluate the probability 
that at least two have the same birth-day. (Assume that the 
year contains 365 days, all of them equally likely as birth- 

days > [0*706] 

5. The odds against a certain event are 5 to 2 and the odds in 
favour of another (independent) event are 6 to 5; find the 
chance that at least one of the events will happen. 

(Agra M. Sc. 45) [£?J 

6. Find the chance of throwing a sum of 9 in a single throw of 

two dice. [ij 

7. If the letters of the word ‘REGULATIONS’ be arranged at 

random, what is the chance that there will be exactly four 
letters between the ‘R’ and the ‘E’ ? [ 6 « 6 -j 

8. It is 8 : 5 against a person who is now 40 years old living 

till he is 70 and 4 : 3 against a person now 50 living till he 
is 80. Find the probability that at least one of these person 
will be alive 30 years hence. (Agra B. Sc. 61 [£"-] 

9. A bag contains 50 tickets numbered I, 2,...50 of which 5 

are drawn at random and arranged in ascending order of 
their numbers, x x < x 2 < x 3 ...< What is the pro¬ 
bability that x 3 = 30 ? (Punjab B. A. 54) ( 29 C 2 X 29 C 2 / 50 C 6 ] 

10. What is the chance of throwing a total of 3 or 5 or 1 1 with 

two d ice ? [ 2 j 

11. It is known that only 2 out of 10 of a particular type of 

operations performed by a doctor are successful. What is 
the probability that he will succeed in at least 3 out of 5 
operations that he performs. [aWc] 

12 A number is chosen from each of two sets : 

1. 2, 3, 4, 5, 6, 7, 8, 9; 1, 2, 3, 4, 5. 6, 7, 8, 9. 

If p x denotes the probability that the sum of two numbers 
be 10 and the probability that their sum be 8, find p x -\-p a 

(Agra B. Sc. 55) [*?J 

13. A throws three coins and B throws two coins. Find the 
chance that A will throw a greater number of heads that B. 

(Agra B. Sc. 55) [£] 


154 


STATISTICS 


14. In a bag are six balls, of which 3 are white and 3 are black. 
They are drawn successively without replacement. What is 
the chance that the colours alternate ? (Agra M. Sc. 44) [xol 

15. The face cards are discarded from two decks of cards; then 
one card is drawn from each deck. What is the expectation 

of the sum of the numbers drawn ? U *1 

16. A speaks the truth in 75% cases and B in 80% of the cases. 
In what percentage of cases are they likely to contradict 

each other in stating the same fact ? (Punjab B. A. 58) [sol 

17. A point P is taken at random on a rectilinear segment AB 
whose middle point is O. What is the probability that AP, 

BP and AO can form a triangle ? m 

18. Three boxes identical in appearance each have two drawers. 
The first box contains a gold coin in each drawer, the second 
contains a silver coin in each drawer; but the third a gold coin 
in one drawer and a silver coin in the other, (a) A box is 
chosen at random. What is the probability that it contains 

coins of different metals ? (b) A box is chosen, one of its 

drawers opened and a gold coin found What is the pro¬ 
bability that the other drawer contains a silver coin ? 

[(a) £, (b) £] 

19. Examine the statement: The probability of Rama passing the 
examination is h and the probability of Seeta passing it is 
Therefore it is certain that at least one of them will pass the 

examination. 

20. Show that it is more advantageous to bet on one ace with tour 
throws of one die than to bet on a double ace with twenty- 
four throws of two dice. 


CHAPTER VII 


CONTINUOUS FREQUENCY DISTRIBUTION. 

7 *1. Discrete and Continuous Variables. Suppose we collect 
data for the sizes of families in a certain town. It is obvious that 
the numbers of members in each family would be in whole 
numbers (if we consider each member as one unit irrespective 
of age or sex). Thus there would be no family with 2*5 or 2 67 
or 3*94 members. The variable (number of members in a family) 
in this case is a discrete variable. The linear scale employed 
with discrete variables is always characterized by gaps at which 
no real measures may ever be found. On the other hand, if we 
measure the heights of a large number of plants and if our unit 
of measurement is very fine, there would be no point along 
the scale of measurement (between the extreme values of the 
heights) at which we may not find the height of a plant, no matter 
how finely we divide the scale. A variable (the height in this 
example) which takes all possible values between its limits say 
a and b is known as continuous variable. The school enrolment 
figures, number of passengers detraining at a certain station each 
day, mortality figures etc. are the examples of discrete variables 
while the weights of school-children, barometric pressures tem¬ 
peratures etc. are those of continuous variables. 

7*2. Continuous Distributions. In the example given above 
viz. the heights of the plants, if we go on measuring infinitely, 
none of our class intervals, however small, will be vacant Thus 
in the case of a continuous variate, we cannot speak ol the pro¬ 
bability of the variate say x taking a particular value x, 
as we can do in the case of a discrete variate. In a finite range 
of the variate, however small, x can take all possible values 
infinite in number and therefore the probability or the relative 
frequency that x—x t appears to be zero although it is not impos¬ 
sible that x = x t . All we can do is to assign a probability to an 
interval of x Thus the probability that a value of x say X lies 
within infinitesimal interval x— £ dx and x+\ dx is 

P {x-h dx < X < x-H dx)=f (x) dx 
where f (x) is a continuous function of x called the probability 


156 


STATISTICS 


density function and is measured by the ordinate of the probability 
curve 

y=f (*) 

and / (x) dx is called the probability differential. 

Clearly the probability that a value of x lies within the 
interval a , b is given by 

P (a < x < &)=j /(x) dx 

which is represented by the area under the probability curve 
y=f (x) between the values x=a and x=b under the conditions 

(i) / (x) > 0 for all values of x. 


ii) j^/ (*) dx= 1 if (a, /3) is the range of x. 


— oo < x < a, 
a < X < p, 
p < X < OO 


(ii) 

Generally it is more convenient to define the range of the 
variable from —» to oo . if the range is from a to p, we define 
f (*) as 

/»=0 

/(x)=<Mx) 

/(x)=0 

and !_«/(*) ^ x== l’ 

which implies that the total area under the curve y—f (x) is 
unity. 

7*3. Cumulative Distributions. Since in the case of conti¬ 
nuous variates, the probabilities are given by integrals, it is more 
convenient to deal with the integrals of densities. Thus the pro¬ 
bability that the value of the variate is less than x is 

F (x)=\ m _^f(x) dx. 

F (x) is called the cumulative function of x or the cumulative 
distribution. The cumulative distribution has the properties 

F(x) is non-decreasing function, 

F ( —oo)—0 
and F(oo) = l. 

The probability density is given by 

The probability that a value of x lies within the interval 
a, b is 

P (a < x < b)=F(b) — F (a). 


CONTINUOUS FREQUENCY DISTRIBUTIONS 


157 


1*4, Any Function. Any positive function of x can be 
changed into probability density function provided it is multiplied 
by a constant which wilt make the area under the curve to be 
unity. 

7*5. Moments. If the range of the probability density 
function is from — oo to oo, the rth moment about origin is 

/V=f X r f(x)dx. 

J —wO 

The rth moment about any arbitrary origin a is 

Hr = j (x-a) r f(x) dx. 

The mean is given by (taking moment about x = 0) 

Hi = f *f (x) dx. 

J —® 

The variance a 2 is given by 

o 2 =// 2 '-/V 2 =J_ xf(x)dx^. 

As proved earlier, 

Hn = Hn+”C i Hn-i /Z n _ 2 (/Xj # ) 2 +■ . . . -f 

Hn=Hn- n C x /xVl /*i' + "C 2 / n _ 2 (/*/)"• 

7 6. The geometric mean G is given by the relation 

logC=f log x.f(x) dx, 

J —CO 

the Harmonic mean H by 

1 f® 1 
H = J_„ x^ (x) dx 

and the mean deviation about the mean by 

— f I x—X | f (x) dx 


— 00 


where X is the mean. 

The mode is the value of the variable for which 


df j x) = 0 and 

dx dx 1 


0 


if this value lies within the range of x. 

The median is the value of x, the ordinate at which divides 
the area under the curve into two equal parts 


CMa „ r* 

/ (x) dx= f (x) dx = \. 
J -® J A/p 


The the lower quartile Q x is given by 


J a V(») dx == 4 


158 


STATISTICS 


and for Q 3 , we have / (x) dx=\. 

J Q.1 

Note. The continuous and discrete cases of relative fre¬ 
quency distributions so that the total frequency is unity are 
included in what is known as Stieltjes integral 


which means 



F(x) dx-=\ 
dx when x is 


S f (x) when x is discrete. 

X 


a 


continuous variable and 


7*7. Solved Examples./ 

1 . In a continuous distribution whose relative frequency density 
is given byf(x)=\x (2-x), the variable ranges from 0 to 2. Show 
that the distribution is symmetrical with mean x=J and variance 
Show that the third moment about x=0 is Verify n 9 =0. 

(Agra B. Sc. ’60, Delhi Hons. B. A. ’51, 58, 60, 

Punjab B. A. ’61) 


jxf (about origin) = £ 

f x 2 (2— x) dx=l, 
J 0 

nf (about origin) = J 

f x 3 (2— x) dx=l> 

J 0 

/if (about origin) = ^ 

j X 4 (2-x) dxr=l. 

t t i = U2 (Mi )” 

Hvo 

Jl 

7 

>•» 

> ii 


/i 3 =Ma / -3^V/+2 mx ,3 =!-V+2=0 . 

since /i 3 which is a measure of skewness is zero, the distribution 
is symmetrical about the mean. 

2. Show that for the distribution given by 

df= dX 


2a' 


Q -n _ a 

M2— 3 » M3 t 1 * « 


— a < x < a 

4 


(Agra M. Sc. ’59, I. A. S. ’59) 


We have. 
Hence 



CONTINUOUS FREQUENCY DISTRIBUTIONS 


159 


/*4 


-f 


1 


a 




3 . The frequency distribution of a measurable characteristic 

varying between 0 and 2 is represented by the following : 

f(x) = x*, 0 < x < 1, 

^(2-x)\ 1 < x < 2. 

Calculate the standard deviation and also the mean deviation 
about mean. < B - A - Andhra ’46) 

The total frequency 


or 


N=\ l x*dx+ 

J 0 

= i+\=h 


J” (2-x) 3 dx 


Pi-2 [f. x.x 3 </x+J x(2—x) 3 (/xj 

= +2^-i (2-x)‘ xj +i j*(2 -x)‘dx 

= f + i—A- [(2-x) s J 

=2(J+}+-A-)=i- 

Now | x 2 .x 3 dx+J x* (2—x) 2 dx 

-M + [{-4 ! < 2 -^‘ }[ +i \[ X i2 - xr dx J 


Hence 


«+£+£ (»+‘kV)=H. 

1 


# * 16 _ 10 

/*2 = ft* 80 = 16 

H’t=P2—(Pl') t = 15 — 1 


= 1V 


a (standard deviation) = 


1 


VIS' 

The mean deviation about the mean 


jy J J *-1 I * a dx+~ J* | x-1 | (2 —x) 3 dx 

2 [J l (1-x) x 3 </x-f j* (x— 1) (2—x) 3 f/xj 

2 [F dx-f* (2—x— 1) (2-x) 8 dx J 

■ 2 (x 3 -x«) dx-f (2-x )* dx+J* (2-x) 3 dx J 

l on simplification. 


160 


STATISTICS 


^^4. Show that for the exponential distribution , 

dP=y 0 e~*l a dx,0 < x < oo, 

the mean and S. D. are both equal to a and that the interquartile 
range is a log e 3. Also find p/ and show that p x =4, P 8 =P. 

(I. A. S. ’56, B. A. Hons. Delhi 1953) 

In order to change the given distribution into probability 

density function, the total area under the curve y=y Q e~ x l a should 
be unity. 

I CC | 

e~*l a dx=\ giving y Q =~. 
o ° ’ 

f°° X r e~ x l a dx=o* r(r + l) 

° Jo 

—o r r ! 

Pi (mean)=a, /z 2 '= 2a 2 , 


so that 


P%=pf—P \~= a ' 1 


Hence 


so that 


Also 


and 


S. D.=a, 

pf=6a 9 t 
pf=24o 4 

p 3 =p a '—3p 2 'pi'+2p l ’*=2o*. 

Px=Pi'—4pfpf -\-()p 2 pi 2 —3p x 4 = 9 a 4 , 

••• 

1 f^ 1 e~ x dx=\ t so that {?,=c log, f, 

- t^ s e~*! a dx=%, so that Q 3 =o log, 4. 
a Jo 

Interquartile range=(? 3 — Q t = a log, 3. 

5. Show that for the distribution 

df=y Q e~' : l 2 x"” 1 dx, 0 < x < 00 » 

Pi' (about the origin ) = \/2. i ^(0 an ^ P%~ n - 

(M. Sc. Agra ’53) 


For probability distribution function 


.Vo 


J e z*/ 2 x n_1 dg= I, 


...d) 


givirg y 0 


=-» since = f • e ~* xn ~ l 

2 (n-J)/2 ft - | J ° 


Putting x l — x * n (1)» we get the value of y„. 


dx. 


CONTINUOUS FREQUENCY DISTRIBUTIONS 


161 


Hence \i r 


■ = _1_ r e-S 

2<n-t)/2 rf-} 


= 2 


G) 

(=P) 


— X*/2 p.n+r-1 


r G) 

Putting r=l, 2, we get the required results. 

6. S/iow //»a( for the symmetrical distribution 

We have. 


/V (about origin) 


2a 


7T 


P * * 

J-a + 


</x 


= ~ [ lo 8 ( fl2 + * 2 )J a =0 * 

2a [ a x* , 
/x,-/x 2 =- 

4a r i. . x~i u 

*=— x—- tan-* - 
tt \_ a aj 0 

a 2 (4—tt) 


7T 


4a x 4 . 

= v r. {*’- a,+ a 4‘4 * 

^Jx 3 —a 3 x+a 3 tan -1 * J 

-o-a* 


4a 

7T 


7. /4 frequency distribution in the range (—3, 3) is defined by 

y =iV -3 < X < -/; 

y=\ l *(6-2x*) t -1 < x < 7; 

> = '»V (3—x)\ 1 < x <5. 

Find the mean and S. D. of the distribution. 

(M. Sc. Agra ’48, B. Sc. Agra ’59) 


iV,=A J”| (3 + *) 2 dx=i. 


162 


STATISTICS 


N, 


M, 

NM= M X N X +MoN 2 f A/ 3 AT 3 
M=0 


= i 1 e[' i (6-2x*)^=f l 

^a=iV| (3-x) 8 dx=l=N u 
so that N=N 1 +N t +N s =\. 

A/j=-A- J ‘ X (3+x) 2 dx=+- 1 h |‘ z (3—z) ! dx 

(putting x=—z) 

= —iV | x (3—x) 2 dx=—M 3 . 

; =iV | x (6—2x 8 ) dx=0. 

Now 
or 
so that 

,j 2 =i’„- j"‘ x 1 (3+x)» rfx+A (6—2x 2 ) x 2 rfx+-, l e- X» O-xfdx 

= 6+£+ f = 1* 

Hence standard deviation =- v //a 2 =1. 

8. For ///e continuous distribution 

dF=y 0 (x-x 2 ) dx, 0 ^ x 
./?«/ ///<? arithmetic mean, the mode and median. 

(B. A. Hons. Delhi ’49, B. Sc. Agra ’60) 

If the total frequency is unity, 

,v 0 [ (x-x 8 ) </x=fo’ 0 =l or y 0 =6. 

Jo 

A. M.=6 [ x (x— x 2 )dx=\. 

J o 

If the median is M d , then 

'Md 




(x—x 2 ) dx=l 


or 4A/ d 3 —6A/ d *-F 1 —0 

or (2A/ d -l) (2A/ d 2 —2A/ d —1 )=0 

giving Af d =*, the other factor gives imaginary roots. 
Now >’=6 (x—x 2 ), 

dy 


dx 


= 6 (1 —2x)=0. 


dv d 2 v i 

For modal value ~=0, ^ is negative, which gives x=$ 

l^ing within the range of the variable. 


CONTINUOUS FREQUENCY DISTRIBUTIONS 


163 


Hence the modal value =i. Since the mean, median and 
mode coincide at x=\ t the distribution is symmetrical about 




9. Supposing the life in hours of a certain kind of radio tube 

has the density f (x) = 1 -^, when x > 100 and zero , when x < 100, 

what is the probability that none of three such tubes in a given 
radio set will have to be replaced during the first 150 hours oj 
operation ? What is the probability that all three of the original 
tubes will have been replaced during the first 150 hours ? 

(M. A. Punjab ’58, M. A. Delhi ’60) 


. 100 

J (*)—-„• 

/(*)- 0 , 

H 


cn 


100 


2 


x > 100; 
x c ICO. 
dx= 1. 


150 100 . 

-r dx = h. 


mo 


X- 


103 X 

Probability that one tube will have not to be replaced during 
the first 1 50 hours is given by 

P (100 < x < 150) = | 

(Note. The probability of failure during the first hundred 
hours is zero.) 

Hence the probability that none of the three bulbs shall 
have to be replaced = ^ 3 = ^. 

The probability of one tube failing during the first 150 hours 
= l — i = a- 

Hence the probability that all three tubes fail during first 

150 hours=(5) 3 =-,V 

10. Show that for the continuous distribution with probability 
differential, 

d p= 1fJT~m) * l ~' ^ -x) fn ~ } dx, 0 < x < I, /, m > 0, 

/i/ (about origin) = 


l (l+l)...(l+r — 1) 


(l+m) (l + m-1). ..f/tm + r-y j ' 


lienee show that 

, / • 

t l \ 


Im 


and H = 


l-l 


l + m * (l+m) 1 (l+ni +J ) " ~ l + m— I 

Also find the mode of the distribution when both I, m are greater 
than unity. 


164 


STATISTICS 


The above distribution is known as 'Beta distribution of the 
first kind ' and briefly written as & (/, m). 

The total frequency 


since 


JV’d *-*(/. 


Also 


In particular. 


7V+/n) 

-*r-^ 

= jb^ B(l+r,m) 

_n[+m) r(l+r).r(m) 

A/) A«) AA-«+0 

Z(/4-l)(/-l-2)...(/+r-l) 

(/+/n; (/+/«+i)' 


/ 


So 


* “!+»• 

u _A/+0_ 

F2 (/+») tZ+m + l)’ 

/*2 ■=/**'“ (/0* 

/(/+!) 


Z 2 


(/-f-m) (Z-r/n+1) (Z-f m)* 


/ (/-hi) 


o 2 . 


(Z+no 2 (/-fz/i-f-l) 


1 


2?[Z—1, /w] / > 1. 


Hence 


//= 


^ (Z, m) 

rtf+m) r(l-l).nm) 
f(l).r(m) ’ r (Z+m— 1 ) 

Z+m—1 

-_ _ _ 

■ Z-I • 

l-l 


l+m-r 

Now the mode of the distribution is at the point where in the 


curve 


y=x l ~ l (1—x)*"- 1 , 

^=0 ®<o 


CONTINUOUS FREQUENCY DISTRIBUTIONS 


165 


or 


1 dy _l— 1 m— 1 
y dx 
dy 


dx 


— x l ~ 1 (I-*) 


1-* 

— / (/—D—y (/+ m—2) 

x (1 — x) 


which equated to zero gives x = 


/-I 


{ 


}■ 


d-y 

dx* 


l+m — 2 * 

It can be verified that this value of x satisfies the condition 
< 0, when / > 1 and m > 1. 


Hence the modal value is 

/-I 


/ 


l+m-? 

11. Show that for p x (l t ni). 


1 , m > 1 . 


where 


log G=^i { { og r(l) — log r(l-rm)}, 
p ' <'■ \\ xl “ °~ xr ' 1 dx - 


as 


Since B (/, /w) = j x l ~ l (1 — x) m_1 dx, 

B (/, /;;)=J log x.x l ~ l (l—*)” 1 - 1 dx 
— B (l, m) log G 

,oexdx - 

••• “>* C -Bihr, h B «'■"> 


= dl (log B (/, m)} 

d . rj) JXm) 

“0/ 108 /-(/+mr 


.(I) 


= |j {log AO+log r(m)— log /\/+/w)} 

{log r(i)-)og r(i+m)}. 

Note. The partial differentiation of both sides of (1) is 
mathematically justified, the discussion being beyond the scope of 
this book. 

12. Find the mean and standard deviation of the distribution 
with probability differential 

_ x‘-» dx 

P ~B (l, m) (l + x)'+ m ' 


0 < x < co, and l, m > 0. 


166 


STATISTICS 


Show that 

, , ... Xl+l) (l+2)...(l+r-l) 

1+ < about or 'S m )= (m -2).::(m-r) 

and the harmonic mean= 

The above distribution is briefly written as (/, m) and 
known as Beta distribution of the second kind. 

Consider the integral 


'*" I, 




dx. 


Putting so that dx = we fi ct 


Ai m 






J o (1— t) l + m 

= r I*-* (i -z )^ 1 j/ 
J 0 

= B (/, m). 

Hence the total frequency is unity. 

1 r°° x r x l ~ l 

Now M,,= 517Tm) J 0 (!+*)»-^ 


B (7, m) t+n m * r 
5 (/4-r, m— r) 

B (/. m) 

r(/-t-r) r(m-r) ni+m) 
l\l+m) * /V) Aw) 

/(/ 4 - 1 ) (/-f-2)— ♦. ,(/-f r— 1) . 
[m— 1) (w—2)...(m—-r) 


In particular. 


V'-m-V 

_ /(/+D _ 

/l2 (w— 1) (in—2) * 

Hence the variance 

I* 2—pz—U*i) 2 

_ /(/-H) 

(m— 1) (m —2) 


1 2 _ /(/+m- 1 ) 

(m — l) s (/77 - l) a (wi -2) 


CONTINUOUS FREQUENCY DISTRIBUTIONS 


\C1 


The standard deviation 


/ f / (/+»»- D \ 
“ V l(m-l)*(m-2) / 


The harmonic mean H is found as follows : 

1 _ f 1 dx 

H~ B (/, m) J x’ (l + x) l+m 

f-lt »n+l» ^ > 1 


F(l-fm) 


Hence II— 


B (/, iw) 

r(/+/;i) r</-n /Vm-h) 

m Am) x 
m 

t-V 
l-\ 
m * 


If / > 1, there is mode at 

/-I 

* m+r 

which is left as an exercise for the students. 

13. For the incomplete Beta function defined as 

/;rove the relation 

I t (I. m ) = I—I\-t (»h !)• 

Substituting x=\ —y in the given integral, 

h B J. m) ^‘u-yy-'y-'dy 

=«(/!«> j'„e-yr-'d, 

=b «> [K < 1 (1 - J ’ ) " 1 >m “‘ Jy \ 

= 1—A-i ("»> /)• 

Note that B (l, m)=~B (w, 

14. S/iovv /«r ///e Gamma distribution 

dp== ~rd> dx - 0<x<o ° • 

the mean and variance are both equal to l. Find also the rth 
moment about zero and the harmonic mean. 



168 


STATISTICS 


The total frequency 

N 


=sJ. e~* x 1 " 1 dx= 1 

Jo 


since 


r(/)= f e“* x 1 - 1 dx. 
Jo 


Now 




Also 


= i- r dx 

m J o 

=r(/+D/r(/) 

=/. 


x 2 e~ x x l ~ 1 dx 


so that 




x r e” x x l ~ l dx 


12 "/’(/) f 

=r</+2)/m 

=/(/+D, 

Pi—V-z (/*i ) 2 

=/. 

Hence the variance is / and standard deviation 

r A/) J. 

=r(/+r)/m 

=/(/+l)(/+2)...(/+r-l). 

It can be verified that if / > 1, the distribution has a mode at 
x=l — l • 

The harmonic mean H can be found as follows : 

''r'e-^dx 

h rd) Jo x 

=A/-i)/r(/) if/> i 
i 

~r 

Hence H=l. 

15. Show that for the chi-square distribution 

/ a*') d ax ')--U •~ u ‘ d <*>- 

r G) 

fx r f (about the origin)=n (n + 2) (w+4).. — 2). 

The above distribution is a particular case of the Gamma 
distribution. The total frequency can be easily found to be one. 

u,'- „ ' : [■ <xV e-w (lx 2 )*"- 1 (»x 2 ). 

r 1 (h n ) J 0 


CONTINUOUS FREQUENCY DISTRIBUTIONS 


169 


Putting bx*—** we 8 et 

, 1 

H-r = 




2 r e~ l tl n + r ~ 1 dt 


©J 

-’■•5-G+O ©’)-©-') 

=n (n + 2) («+4).. .(/t+2r—2). 

In particular, the mean 

and v- 2 — n (»+2) 

so that (/^i)' a 

= 2/i. 

Hence the S.D. = \/(2/»)- 

16. 5/iow //iaf fAe wean va/we o/ f//e positive square root of a 
y (l) variate is T(l+ ^)IT(l), Hence prove that the mean deviation 

of a norma l variate from its mean is a ?. 

The Gamma probability density function is given by 


dp= -for dx - 


0<X<oo. ...(1) 


Now E (V x )=j^i f J y/xe~ z x l -> dx 


...( 2 ) 


= /V-H)//V). 


(u — m) 1 

Substituting x= „_ 2 - and putting l=\ in (2), we get 


2a ; 


I 


(« —m)* 

2a* (M—m) 


or 


A*),m 

2 r 

J in 


t/w 


I 


a- T(A) 


(a-m)* 

201 (M-W) ...(3) 


\/(2m7) 

since /’(*) = V- X * 

But the L. H. S. of (3) is the mean deviation of the normal 
variate u about the mean m. 

Hence the result. 

17. Show that the value of y 0 when 


( x\ rna 
1 + -J dx. 


— a ^ x < » 



170 


STATISTICS 


gives a probability density function is 

(ma) ma+1 lae ma r(rna+1). 

Find also the mean and variance of the distribution . 
The total frequency 


N 


f oo / X\ ma 

dx . 


0 


Putting \+?=£- a , v/e get 


*-* “ e->. 
m Jo 


yVUX 

,am -L - dX 


(ma) 


ma 


0 

1) 

(ma ) Tna+l 


(ma) ma+1 

If A r = 1, we get >’o=^ma r ^r,y 
The rth moment about —a is 

I co / XN 1710 

_ i (x+ fl )' e -“(l + -) dx 


=>V>' 

-> I 




dx 


ao y ma+f 

e -v e am X dx 

* (mfl) ma+r 


0 

a r e ma 


= v flre r(f „ fl + r+ l) 
( ma) rna * r 


In particular. 


r(ma+r4jl) 
m r T(ma -I-1) 

r(ma-\- 2) wfl+1 


m.T(ma+l) m * 

, 7'(ma+3) (ma+2) (ma+\) 


v-z = 


wr.T(/wa-H1) 


m 


so that 


(ma+2) (ma+ I) (ma+ \) 2 = maf-\ 

— 


m* mr m 2 


Hence variance= 


ma+ 1 


•* 

nr 


17 Find the mean, mode and median for the distribution 

( if= i sin x dx 0 < x < ir. (Agra M. Sc, 

= i I " sin x dx = 1. 

J 0 


iV 




CONTINUOUS FREQUENCY DISTRIBUTIONS 


171 


M 


=>J 


x sin x dx 


-•[- 


X COS X 


]> i: 


cos x dx 


To find the mode. 


77 
2 * 

y = \ sin x, 

dy A 

~~ = h cos x=0, 
dx * 

7 T 


Hence modal value is ^ at which ^=0 and ^ is negative 
Also if the median is denoted by M d , we have 

. [M* . . . 

i sin x dx=\, 

~lA/d 


-i < 


_ i 


cos x =i t 
Jo 


7T 


giving 


cos Md =0 or Md = y 


19. Findy 0 , and /i 3 for the distribution , 

’4 . \ 2x 


dF=y 0 ? *) 


a 0 e « dx, < oo. 


Since y 0 ( (l x) VaJ J e a dx=N, 


(\-l) - 2 ^ 

V** ' n “ ^/y = 


putting 0=^, we have >'o |_ ri px dx=N. 


x t 

Now putting 1+ «=£*» 


dx=p dt. 


N 


_ [" *— / p,_ l 1 

“No * 'P 


dt 


~ y °\ 0 p2[i*-l /P Ie 1 dt ~ y o p2f-l 

For probability density function N= 1, so that 

020*-1 2 

y 0 = -R?^-i where p= 

\T 0 * « 

/j20*—1 rw / x vp3_i 

Now #*/ (about-0) = - p ,—— J_ p (*+0) r p * </* 


£ 0 * 


020 -,^ p £ 3l+r-1 

rp £ Jo 




172 


STATISTICS 


so that 


1 T(F+r) 
~TI 3* p r 

,_ np+ I )_ g 

/Z| ~ r(iS 2 )./3 


,V=/3*+l, = 


, (ff 2 -4-2)(/3 2 -H) 


Mean Af=—0+^ i'= — P+P—Q 

)-(«■-1, 
/i3=^-3 / z 8 V+3^V/ 2 -/^ #8 
__(/3*-f-2) (fi 2 4-1)— 3 ( / 32 +1) ^ + 2/3 8 

P 

2 

~P‘ 

Hence Mea ” = °’' H=1, ''’“H' 

20. S/imv that if the curve 

y=y 0 e- m,x X- v , 0 < X <oo , 


Hence Mean =°> ^ =1 ’ "» = ^ =a - 


/w* -1 


• • r 

represents a probability density function. y 0 is given by YfpZ-Jf 

Prove also that the nth moment about the origin is given by 

, ni n ' w Ffp—n—l) 

Vn - , n l-J>r(p-J) 


and hence 


Mean = 


Cr-*)~(p- 2 r 

4m 3 

M3= fi=2)*(p-3) (p—4) 


p-2 


»< 

=>>0 

( 


<r m/ * </x. 


Putting we have 


N=m l -» y 0 j" e-* ^' s m*-*l\p— 1)=1 


mv-i 


so that 


tat y^pip-iy 

m p-i f* 

Also P.’ (about origin)= r(/7 _ T) J o *“e—'■*-» dx 



CONTINUOUS FREQUENCY DISTRIBUTIONS 


173 


mP—l r® 

s - e -mfx x -(p-n) dx 

r[p-i) Jo 

m n+i-pf(p_„_ 1 ) m n r (p— n — 1 ) 
= m'-*r(p-\) ~ 

mT(p — 2) w 


Mean-^'—f^nj “p-2’ 


giving 


, m 2 r<p — 3) 

'*•" r<p-o 

, m 3 r(p — 4) 

^ iRp-l) 

m 2 

4/w 3 


w 


(P~2) (p-3) 

m 3 

(p—2) (p-3) (p-4)* 


^ 3_ (/? —2) 3 (p-3) (p-4)‘ 

21. Through a point B on the y-axis whose ordinate is posi¬ 
tive and equal to a, a straight line is drawn in a direction taken at 

random in the interval < 0 < ? 0 being the inclination of the 

line to BO. Examine the probability distribution of the intercept x 
on the x-axis. 

If 0 is assumed to be uniformly distributed in the interval 

— - to" the probability that a value of 0 falls between 0 — \dd 
4 4 

and 0-f \ dO is But the intercept on the x-axis is 


W2 


or 


x=a tan 0 
0 = tan -1 

a 


dx. 


a 

x 2 -4- 

But when 0 falls within the interval dO, x falls within dx. 

Hence the probability distribution of x 

2a 


df = </» (x) dx = 


dx 


— a < x < a. 


n (a*+x 9 ) 

(See Question No 6 Page 161). 

22. /! po/ttf P fy fa/ce/; ar random in a line of length 2a, all 

positions of the point being equally likely. Show that the expected 

2 a * 

value of the area of the rectangle AP.PB is -j and that the pro - 
babillty of ,he area exceeding K is ([) Sc Agra >56 , 58) 


\U)<n 



174 


STATISTICS 


We interpret that x is uniformly distributed on the line AB. 
Hence the probability of a value of x lying in the interval dx is 


dx 
2 a 


or 


i.e 


The probability distribution of x is df— 0 < x < 2a. 

The value of the area AP.PB=x (2a— x). 

i r 2 ° 2a* 

E (Area AP.PB )= 2 ~ J x (2a-x) Jx= 3 . 

The area AP.PB will be greater than £a a when 

x (2a—x) > \a % 

2x 2 —4ax+a s < 0 

{*-( a -V fl z)} { x “( a+ 72 )} <0 

when a- ^72 < X<a+ ^2 

1 f a+(a/V2) 1 

Hence the required probability=— I _ ax— 


_ (a/, 12 ) "" V 2 

23. From a point on the circumference of a circle of radius 
a. a chord is drawn in a random direction (all directions are equally 
likely). Show that the expected value of the length of the chord is 

- and that the variance of the length is 2a* (1-81**). Also show 

lhat the chance is l that the length of the * chord will exceed the 
length of the side of an equilateral triangle inscribed in the circle . 

(B. Sc. Agra ’60, M. A. Punjab ’59) 

6 ranges from — ~ to Hence the pro¬ 
bability that a value of 0 lies within the interval 


or 


d0 
is —. 

7r 

Now 

AP=x=2a.cos 0 


, X 

0 = COS“* ="• 
2a 



3 


dd=- 


1 


V(4a J —x 2 ) 
Hence the probability distribution of x is 

df= — X - ■> - ». dx. 


srr dx. 


E(x) 


-n 


it v'(4a 2 —x 2 ) 
x 


dx 


t r y/(4a 2 —x 2 ) 

Changing the integral by putting x = 2a cos 0 , 

i?.p 2acc,#d# ”« 



CONTINUOUS FREQUENCY DISTRIBUTIONS 


175 


cos 8 6 dd 


1 f rr /2 8 < 7 2 f 7^/2 

n f 2 (about origin)—- j ( 2 a cos 0) 8 </ 0 = — 

= 2a*. * 

Hence #*2 = ^ 2 ' — (#*i # )* sss 2a 8 (l — -*)• 

The length of the side of the equilateral triangle inscribed in 
a circle of radius a is ay/3. 

Hence 2 a cos 0 > a-v/3 

. V3 


or 


or 


cos 0 


* ^ a , n 
“6 < 0 < 6 


The required probability = 


24. /I chord of a circle of radius a is drawn parallel to a 
given straight line , all distances from the centre of the circle being 
equally lik ly. Show that the expected value of the length oj the 

chord is lira and that the variance of the length is (32 - 3n~), 

Also show that the chance is h that the length of the chord will exceed 
the length of the side of an equilateral triangle inscribed in the 
circle. 

If the distance of the chord from the centre is taken to be x, 
the probability distribution of x is 

d f^fa' ~ a ^ x ^ <l 

the length of the chord being 2 y/(a' t — x' i ). 

Expected value of the length = E [2y/(a-—x-)] 

=11, w-**' a 

? f a „ # TT(1 

= - \/(tf 2 -x 2 ) dx= 

a Jo 1 

f a , , dx Xa- 

E [4 (^-x 8 )] = J_ u 4 (u--x 8 ). -= 3 -. 

variance of length = —— |2 ^2 3”) 


And 


If 2>/(u a -x 3 ) 
in a circle. 


av /3 = side of the equilateral A inscribed 


i.e. 

l.e. 


4 (a 8 —x 2 ) > 3a 2 or x 2 < 


a 


2 


a a 

- 2 <X< 2 


The probability 


f “' 2 dx 

J -an 2a~ 



176 


STATISTICS 


25. Show that for a set of values of a continuous variable the 
sum of the absolute values of the deviation from the median is 
minimum. 

Let the probability density function be ^ (x) dx where 
— oo < x < oo and the median being at x=0 so that 


f <£(x)</x=J 4>(x)dx=h . 


...( 1 ) 


The integral 5 (h) of the absolute values of deviations from 
x=h is 

S (h)= f | x-h I <P (x) </x= [ (h—x) ip (x) dx 

J -a J “co 

+ f (x— h) <t> (x) dx, h > 0 
J h 

+ j ( h—x) </> (x) <*x]+Q o — (x—/i) (x) dx J. 


S (0)=J (—x) (x) dx -f | x4>(x)dx. 

Hence S (/j)— S (0 )=h Q <P (x) dx— J «/* (x) dx J 

+ 2 P 0 h-x) <P (x) dx, 
Jo 

the first integral within the brackets being zero by (1). 

Since (x) is a positive function and h—x is positive between 
0 and h, the R. H. S. is positive and hence S (h) > S (0). The 
same result can be proved when h < 0. 

S (/j)=s=Q (h—x) ip (x) dx +1 (x—h) iP (x) dx J 




+ 


U (x—h) if, (x) dx+^ (x—h) iP (x) 


5(0) 


-L <- 


X) iP (x) 


dx+ [ 


x >p (x) dx 


S (h)-S (0) = /j <P (x) rfx-j* <p (x) </xJ + J° (x-h)iP(x)dx J 


= 2 f (x—h) <f, (x) dx from (1). 

! h 


Since h < 0, the R. H. S. is positive. Hence 5 (h) is least 
whenfc =0, 


CONTINUOUS FREQUENCY DISTRIBUTIONS 


177 


26. Show that 

J* e~*'-dx = xe~ x '- [l + i .(2x*)+ L (2x‘)*+... J. 

(Agra M. Sc. ’55) 


Now J e x ‘ dx=£xe * 2 J +2 J x 2 e x ~ dx 


= xe~ x *+2. \e ** .jc 3 + 


2.2 


■* 

. n 


4., — X 2 


x'e 


dx 


= * ( 2 *^ -^ f* x*e-* dx 

J . 3 J. 3 J o 


= **-*’jl+i (2x*)+~ ( 2 **) 2 +,-U ( 2 x 2 ) 3 -f- 


3.5 


3.5.7 


• • • j 1 


27. If F (x) = - 77 ! f e 5 j/ioh' that 

V 1 J -® 

(l) +ri/?+•••]tf 1 ** Ima "- 

<u > F fxJ=, -wh*j e ~^ [ I ~p+^+- •] ,/x " /flrse 

(Vikram B. A. ’61) 

(i) 7 r (x) = - 1 f «“*** «/*+ -y- * f <? 

V(2")J-* V(2") Jo 


~i A * c/x 


when x is small 


= 1 t m e~i*dx + 


1 


V( 2 w) J 


V(2n) 


Jo <J 1 —(iJC 2 )-f \x* Yl 


_ 1 *•+ 

3 ! 2 3 ^ 


(ii) F (x) — 


4+ V(2-)t 

1_J_ 

1 25 

V^(2w) . 

— 3D 

I 

•» 

V(2w) . 

-00 

. J 



...Jrfx 

I" x» l_ x 6 I a : 7 "] 

[_* 2.3 + 2 ! 2*.5 “3 ! 2 3 .7 + ** , _T 




1 

•v/(2rr) 


j: 


e — l* 3 dx 


\/(2tt) 


I 


00 


xe 


-Jx* 


</x. 


Integrating by parts treating xe * x * as one factor, we get 

1 I* 


F(x)~\- 


1 V( 2 wJ {* * 


3 


m: 





178 


STATISTICS 


= i -$£j{ i 44—-} whenxislar8e - 

Q * 28 ’ If 2 f* 

(hxe~ h ' x 'h dx t where <f> (y)-^ J o ( y d ?' 

e~y' r, i . 13 1 

show that <{> (y)=l-jjZl l -2f+(Wr-(2+p+ • • • J* 

(Agra M. Sc. ’57, B. Sc. ’56) 

See the previous question part (ii). 

29. Defining the harmonic mean (H. M.) of a variate x as 
the reciprocal of the expected value of 1 /x, show that the H. M of 
the variate which ranges from 0 to *> with probability density 

x g .- is n, given that n is positive. 

1 

, x 


(Agra B. Sc. ’58) 


I=| <f> (x) dx. 


In this case \] i ' dX ’ VI J." ^ * 


= K Tn=\ 


n 


i 


30. Two points are selected at random in a line AC of lengt 

a, so as to lie on opposite sides of its middle point O. Find 

the probability that the distance between them is less than a/3. 

(Agra B. Sc. ’60) 

Let A be the origin and the points be D and E, with distances 
and Xo respectively from the origin, so that 


0 < x x 


a 
.■* * 


a 


< Xo < a 


Xo — x 


a 

3 


and 

Assuming that all positions of D are equally likely between 
A and B and those of E are equally likely between B and C, 
their probability distributions are respectively 

oft-—* 


CONTINUOUS FREQUENCY DISTRIBUTIONS 


179 


The joint probability 


*/2 r Xl +a /3 dx x dx, 


-n 


a/2 * o/2 

a/3+X, 


_A 

- o 


Note. Since x x and x 2 are independent variates, their joint 
distribution is given by df x .df 2 . 


31. A straight line of length a is divided into two parts. Find 
the mean value of the rectangle contained by the two parts. 

Let the length of the line be a and that of one part be x, so 
that the length of the other part is a—x. The rectangle contained 
by the two parts is x (a — x). Also x can have values between 0 


and a. 


Hence the mean 



32. The sides of a rectangle are taken at random each less 
than an Inch and all lengths are equally likely. Find the chance 
that the diagonal is less than an inch. 

If the sides of the rectangle are taken as x and y, we have 

0 < x < 1, 

0 < y < 1. 

Also •v/(x*+> ,a ) < 1 

or x* + y 2 <1 or y< 


If all lengths of the sides between 0 and one inch are equally 
likely, the probability distributions are 

df x = dx, 
df 2 — dy. 

J i r v(I—**) , , 

j dx dy 

* V( 1 ~ **) dx. 


1 


0 



180 


STATISTICS 


Substituting x=sin 6 and integrating, we get the joint pro- 


7r 

4* 


bability= 

Note. Since x and y are independent variates, the joint 
distribution of x and y is 

df t .df 2 =dx dy. 

33. There are two clerks in an office ; each of them goes out 
for an hour for lunch. One may start any time between 12 and 1 
o’clock, the other at any time between 1 and 2. All times within 
these limits are equally likely. Find the chance that they are not 
out together. 

Suppose the first clerk goes at any time x after 12 and the 
other any time y after 1 . 

Then 0 < x < 1, 

1 < y < 2. 

If both clerks are out together for any time, 

y—x < 1 
v < x+ 1 . 


or 


(i ri+x 

The joint probability— dx dy 

Jo J 1 


= J x dx. 


U 1 1 
Hence the probability that both are not out together=l — 

34. Two points are selected at random on a line of length a. 
What is the probability that none of the three sections into which 

the line is thus diviaed is less than 


a 

4' 


Let AB be the given ^ Q t _ 

line divided into four ^ C D Ed 

equal parts AC, CD, DE and EB by the points C, D and E. the 

length of each part being Now none of the points P or Q 

can be in AC or EB. 

Let P be at a distance x from A. Then 

\a < x < *. 

Similarly if y is the distance AQ, 

° + x < y < In. 


CONTINUOUS FREQUENCY DISTRIBUTIONS 


181 


Hence the required probability 


re* p 

Jo/4 Jf 


}a+x 


dx dy 


1 ° 1 ! * 


= 'i€» 


— 100 < x < 0 


0 


100 


35. A bombing plane carrying three bombs ffies directly above 
a railroad track If a bomb falls within 40 ft. of the track , the 
track will be sufficiently damaged to disrupt traffic. With a certain 
bomb sight, the density of points of impact of a bomb is 

f(x)== (]00+x ) 

J 1 ' 10,000 ’ 

_(100-x) 

~ 10,000 * 

— 0 elsewhere . 

x represents the vertical deviation from the aiming point, which 
is the track in this case. If all these bombs are used, what is the 
probability that the track will be damaged ? 

The track will be damaged if a bomb falls within ±40 ft. 

of the track. According to the given probability distribution, 
the probability is 


P (—40 < x < 40) 


-L 


100 + x 


dx + 


f 


40 100-x 
10,000 
r i 


dx 


+ T^ X ~2 


X* 

10,000 


]" 


10,000 

X 2 "JO 

iooiu,oooj_ w ^LiOO 

_4 4 16 

~5 — 25 — 25* 

The probability that the track will not be damaged by one 
bomb=- 8 ° 1 f. Hence the probability that none of the bombs falls 
within ±40 ft. of the track is 

The probability that the track will be damaged 

— 

= 1 — *046656 


= •953344. 

36. Find the mean of the square of the distance of a point 
within a given square of side 2a from the centre of the square. 

If any point within the square is taken with coordinates (*, y) 
the origin being taken at the centre and two perpendicular axes 
parallel to the side of the square as the axes of reference, the 
(distance) 2 of (x, y) from the centre is x i +y t . 



182 


STATISTICS 


The distribution of (x, y) within the square is ~ 


Hence the required mean value 




-if- ['■'+*’■]> 

£. (2^+l« 3 ) dx 

=4^[f x3+|aSjc L 


2a 2 

3 


37. /4 country filling station is supplied with gasoline once a 

week. If its weekly volume x of sales in thousands of gallons is 
distributed by f (x)=5 (l—x)* y 0 < x < 1, what must be the 
capacity of its tank in order that the probability that its supply will 
be exhausted in a given week is *01 ? 

Let the tank capacity be y thousand gallons. Then the 
probability that y thousand gallons are sold in a week should be 
•01, i. e. the probability of sale less than y is 1 — *01=’99. 

[ V 5 (1—x ) 4 dx= *99 

'o 

—[ ( i—x) 6 ] V =•" 

( l _ j ,)6 = *01 

1 —>>= *39811 
;>= *60199, 

so that the tank capacity is 602 gallons nearly. 

Exercises 

1 , (a) Is the function defined as follows a density function ? 

/ (x)= 0 , x < 2 , 

= -A-(3+2x),2 < x < 4 

=0, x > 4. 

(b) Find the probability that a variate having this density 

will fall in the interval 2 < x < 3. [Yes; $J 

2. Show that for the rectangular distribution 

df=dx y 0 < x < 1 , 

^l' = a 


or 

or 

or 

giving 


and 


CONTINUOUS FREQUENCY DISTRIBUTIONS 


183 



Show that in order that the frequency function 

/(x) = cx 5/2 (1—x) 3/z , 0 < x < 1, 
may be a probability density function. 





6 . 

7. 




Also show that M 2 ="eV- 

The distribution of F for n and ri degrees of freedom is 
given to be 


f(\n- 1) Af 

dP=y o (nF+ri 0 < F < 00 • 

Determine the constant y Q and show that the mean of the 

ri 

(Delhi B. A. Hons. ’56) 


distribution is 


ri- 2 * 

Find the mean deviation, standard deviation and skewness of 
the distribution given by 

/(x) = 2* (2 — x) 0 < x < 2. 

(Delhi B. A. Hons. ’58) 


Prove that the geometric mean, G, of the distribution 
df= 6 (2—x) (x— 1) dx , 1 ^ x ^ 2 

is given by 6 log (16C7) = 19. (Delhi B. A. Hons. ’58) 


Find /z/, fx 2 '. Ma\ Ma. Mi. Pi and Pz f° r the continuous distri¬ 
bution df=yjx (2 — x) dx , 0 < x ^ 2. 

(Punjab B. A. 1961, Agra B. Sc. 1960) 
A ns. yo — 4* Mi — 1» Mz = 6. M 2 =o* 

Ma /= 6. M3 = °. M« = V*<f. ^i = 0 » = 

If the function/(x) is defined by 

/ (x)=ce~ x , 0 < x < oo, 

find the value of c which changes/(x) to a probability density 
function and hence evaluate the first four moments about the 
mean. Ans. c= 1, /z 2 = 1, M 3 = 2, mi=9. 

Defining the harmonic mean (H. M.) of a variate x as the 

reciprocal of the expected value of show that the H. M. of 


the variate which ranges from 0 to cc with probability 
density —— is n, given that n is positive. 

(Agra B. Sc. Part II 1958) 

[Hint. See Ex. 14 Page 167.] 

10. A machine makes bolts with diameters distributed by the 
density / (*) = A' (x- • 24)- (x-*26) 2 . 



184 


STATISTICS 





when *24 < x < ’26 and zero otherwise. Find K which 
will change the above function into a probability density 
function. Ans. "as, 10 10 . 

Show that, for the distribution 

df--=y 0 dx, I x—b\ > a 

y 0 =~* mean=o and variances^. 

Variate x has the density 


s/ 


2 _ 


xe 


lx 


, x 


0 . 


Find its mean and variance. 

Ans. Mean= k /M 


%/©[ 


For the distribution 


dP=y 0 e~' x 1 dx, — oo ^ x < oo, 
show that y 0 = 2 » /*i'=0. o=*y/2, mean deviation about 
mean = 1. 


CHAPTER VIII 


IMPORTANT THEORETICAL DISTRIBUTIONS 

8*1. Theoretical Distributions. The frequency distributions 
described in the previous chapters refer to the samples drawn 
from the population. When the values of the variate in the 
population are distributed according to some law which can be 
expressed mathematically, such distributions are known as theore¬ 
tical distributions as dsitinguished from the frequency distributions. 
The most important theoretical distributions are Binomial (due to 
James Bernoulli), the Poisson (due to S. D. Poisson) and the 
Normal (due to De Moivre, Laplace and Guass). The first two 
are discrete distributions and the third a continuous one. 

8*2. Binomial Distribution. If we denote the happening of 
an event by a success, the probability of which is p and that of its 
non-happening i. e. failure by q and a number of independent 
trials are performed, we would like to know the probabilities of 
no success one success, two sucesses and so on, p and q remaining 
unchanged in all the trials. When a coin is tossed, the probability 
of the head falling uppermost is p (say) and tail falling uppermost 
is q so that p-\-q — 1. Now suppose the coin is tossed ten times, 
we may be interested in calculating the probabilities of heads 
falling uppermost 0 times, J time, two : times,.... ten times. For 
asymmetrical coin it is evident that p=q=* £. It may be noted 
that the trials are independent /. e. the fact that one trial has 
resulted in head falling uppermost does not alfect the probability 
of head or tail falling uppermost in subsequent trials. Moreover, 
the values of p and q remain invariant in all the trials. The out¬ 
come of a single toss is either a head or tail which we donote by 
// and T respectively. The second trial may result in a head or 
a tail and the outcome of the two tosses can be represented as 

//// or IIT or 77/ or TT. 

Similarly the outcomes of three tosses can be as 

HUH or HUT or IITil or TUI / or UTT or 777/ or TUT or TIT. 

If we disregard the order of the falls of heads or tails, wc have 
one case of heads or tails all the three times and three eases each 



186 


STATISTICS 


of (i) twice heads and once tail, (ii) twice tails and once head. 
The probability of heads all the three times is p 3 , all the three 
times tails is q z , the probability of twice heads and once tail in a 
specified order is p 2 q and since it can happen in three different 
orders mutually exclusive, the total probability of twice heads and 
once tail irrespective of the order is 3p 2 q. Similarly in three 
tosses of a single coin, the probability of twice tails and once 
head is 3 q 2 p. 

Now suppose out of n independent trials, the heads fall x 
times and tails n — x times. One of the orders can be 

HUH ... H TTT ... T 

x limes n—x times 

the probability of this event being p x q n ~*. Now the falling of 
heads x times cut of n trials can take place in n C m different ways 
if we consider all the possible orders in which an event results in 
success x times and in failure n-x times out of n trials irrespective 
of the order and hence the probability of x heads and n—x tails is 

”C m p*q«— ...(1) 

which is the (* + l)th term in the binomial expansion 

(q + p) n ...(2) 

where q-\-p= 1. Consequently the probabilities of 0, 1, 2, 3,.. .n 
heads in n tosses of a single coin are respectively 

q n , "C,pq"-\ n C„p*q n ~- t n C 3 p*q '-*..., n C n p\ 

1 he distribution 

f {x) = n C x p*q n - x . * = 0,1,2 ,...n ...(3) 

del ’•mines a probability distribution known as Bernoulli's distribution 
or Bit oniial distribution since the total frequency is unity. 

The results cf the above example of the toss of a coin can 
be applied to a set of trials of any otner event which satisfies the 
conditions : (i) the trials are independent, (ii) the probability of 
success remains constant in all the trials. 

If there be N sets each of n trials, the frequencies of 0, 1, 2,. .n 
succsses are given by the terms of the expansion 

A r (q+P) r , 

the frequency of * successes being 

N. n C z p r q"- x t 
x being a positive integer < n or ?.ero. 


IMPORTANT THEORETICAL DISTRIBUTIONS 


187 


8*3. Pascal’s Triangle. The coefficients of the binomial 
distribution are given by the triangle 1 1 

in which the numbers in a row, say 1 2 1 

third, are formed in the following 1 3 3 1 

manner. Make the first term of the 14 6 4 1 

second row as the first term of the 1 5 10 10 5 1 

third row; add the first term of the second row to the second 
term and put the sum in the second place of the third row and 
so on and finally make the last term in the scond row as the last 
term of the third row. The first row gives those coefficients when 
« = J, tbe^second when n = 2 the third when n = 3 and so on. 

v-^£* 4. Moments of the Binomial Distribution. Taking our 
origin at 0 success, the successive deviations from the origin are 

0, 1, 2,.. .n and the relative frequencies are given by (3). Thus 
we get 

/V = 0.<7"+ 1."C, q ti ~ x p -f 2. n C,q n ~ 2 p 2 +.. 

= np f > q n - l + ’'-'C l q n - 2 p+ n -*Co<r- 3 p*+ . . • +/>"-*} 

="P (q+p) n “ l 

~ n P since q+p= 1. ...(I) 

Hence mean — np. 

!L% = E "C x p*q n ~ x x 2 

*=0 

= Z "C x p'q”-* {x (*-!)+*} 

X=» 1 

=n (n — \)p z Z n ~ 2 C x ~iH n ~ x p z ~ 2 

x=2 

n 

+ np Z "~ x C c .. x q n ~ I p t ~ x 
x*= 1 

= n (n— 1) p 2 (q+p) n -*+/ip (q-hp) n ~ l 

= n (n-\) p 2 + np. ...(2) 

n 

Also p 3 ’= Z n C x q"- X p*x 3 

x=0 

n 

= Z n C x q n ~~*p r {x (x— 1) (x-2)+3x (a:— 

x—0 * 

= « (fl — 1) (n— 2) p 3 (q+p) n - 3 

-f 3/i (n— 1) (q+p) n ~ 2 p 2 +np ( q+p) n ~ l 
( n -\) (n — 2) p*-+-3n (n-l) p 2 +np. ...(3) 



STATISTICS 



Similarly writing x* as 

xU _l )(x _2) (x-3) + 6jf (*-l) (*-2) + 7* (*-!) + *, 
we get 

p A ' = n (m-1) (// —2) (n-2)p* + 6n (/i-l) (n-2) p 3 

7/i (/i— 1) p-+'p. 

...(4) 

The variance 

/ '2 
/<o— M 2 l J 1 

= /i (w—1) p* + np—n 2 p 2 =np (1—/>) 

= npq. 

lienee the S. D. n=y/(npq). 


and 


l lence 


/*3 = 

A 1 3 

+ 2, 

— 

"/></ (</ — />) 

/'l = 

3/’ q- 

fl 2 +/><//! (1 

/?. = 

•» 

/V_ 

(q-r)‘ 

»• 1 

A*s a 

npq 

= 

"4 — 
• # 

3+'- 6 /’ 9 


A* 2 

npq 



- q ~P 

Vl ” 

v l«/»(7) 


(B. Sc. Agra ’61) 


12 1 ~ ' npij 

It can be easily seen that the mean of the binomial distribu¬ 
tion is greater than the variance, since q < 1. 

8*5. Show that the kth moment M* about the origin of the 
binomial distribution of degree n is given by 

m ‘=0 O ' (p+q) ”- 

Hence or otherwise, obtain the coefficient /?j of the binomial 
distribution .deducing the formula you use. (Agra INI. Sc. *49) 


n 


We have 


<P + q> n = - "C r p r q»-\ 

r=0 


d 

•I 


n 


ip-\-q, n 


1' r C.p r ' 1 q' t ~ r .r, 

r = 0 


p '(/> + </)" = A’ n C r p T q n ~ T . r= Af l = rtp, 

'/ r=0 


9 ( 


rM 


71 


n \r #) t/'-t */r?= n C r p r ~h] ,l - r .r *, 

t/’ l </’ I ru»0 


/’ ,(/H ‘/) 

«/’ 

( „ '* / .. I .Mil 





IMPORTANT THEORETICAL DISTRIBUTIONS 


189 


~ % n C r p'q’'-r. r * = M 0 

r=0 


=n (n-l) p* [See (2) of $ 8*4J 

Now let us suppose that the formula is true fora value k. 

M k = (p (p+ q) n = n C,p r q n ~ r . r k t 


then 


ip {O’ iy [p+<lY ) - £ 0 "C,.rp'-'.r- 

p ip {( p ipi {p+q)r ‘}=ii 0 " c 'P T r-’-' M 


T r K 


^ k +1 • 

Thus by mathematical induction, we see that if the formula 
ol s for a value k, it also holds for k- 1 - 1 . Wc have already seen 
that it is true for second moment; hence it is proved to be true for 
3rd, 4th, 5th all positive integral values of k. 

Now /?, can be obtained as in § 8*4. 

8 , 6. v _J£-a coin with probability p for head is tossed n times, 
find the number x of heads with maximum probability. 

(M. Sc. Agra 62, Delhi ’49. ’56) 
We have seen in 8’2 that the relative frequencies of 0, 1,2...//, 
heads are given by the successive terms of (<j-\-p) n . 

f (r)= n C T p r q n - r , 

/(r+l)="C r+l /?'+><7"- r -i, 

/ (r-f I ) n — r p 
f (r) ~~r+ 1 V 

Now / (r -f-1) > / (r), so long as 

(n-r) p > (r-f 1) q 

or r < np — q. 

II the integral part of np—q is x— I, then x gives the num¬ 
ber of heads with maximum probability. If np—q is an integer, 

then the probabilities of np—q and np — q-\- 1 successes shall be 
equal and maximum. 


8*7. For the binomial distribution, prove the following relation : 

/'r+i = Pq ( nr /^-i + dp )' 

(M. A. Delhi ’58, Madras ’60) 
We have, hr = £ f(x)(x-,/ x y 

x-»0 


190 


STATISTICS 


n 

= 27 n C x p x q n ~ x ( x—np) r , 

x = 0 

the total frequency being taken to be unity. 

</ ( ir = 27 n C x p x ~ i q n ~ x (x — np) r x 
d P x=\ 

_ 27 n C x p x q"~ x ~ l ( x—np) r (n—x) 

.x = 0 

n 

— 27 n C x p x q n ~ x (x — np) T ~ l nr 

x=0 

by putting r/=l—/>, which gives = — 1. 


Hence ^ r r= 27 " C x />*“ V"*" 1 (x-np) r (xq-np + xp)—nrp r - x . 

d P x=l 

pq d Jp=Z n C x p T q n ~ x (x - np) r +'- pqnr. p r . x 


or 


= Pr+X-P<l” r Pr-l 

p-,+1 —pq {^ r + ,,; > 

This recursion formula is due to 
8*8. Solved Examples. 


•4 


Ramanovsky. 


1. 5//ou’ iliat. if np be a whole number, the mean of the bino¬ 
mial distribution coincides with the greatest term. 

See $ 8 6, the mean of the binomial distribution being np. 

2. Show that if two symmetrical distributions (p = q = b) of 

degree n (the same number of observations) are so superposed that 
the rth term of one coincides with the (r + l)th term of the other , 
the distribution formed hv adding superposed terms is a symmetri¬ 
cal binomial of degree n + 1. ( Agra M. Sc. ’47, *49, ’53, ’57, ’61) 

The successive terms of the binomial distribution are 


N (*)", A r "C, (*)", N"C 2 (h) n ... t N n C r (},) n ...N (l) n . 

The rth term of the first distribution = N "C,., (l) n * 

The (r-f l)th term of the second distribution = N n C T (A) n * 

On addition N (A)" "C r } = A r H)". n+l C r . 

The new distribution is 2A ; (A +A) n+1 , the (r+l)th term 
being 2AT n+1 C r (A) n+1 of a binomial distribution of («+ l)th degree 
with total frequency 2iV. 

3. A biased ioin was tossed 5 times and the whole experiment 
was repeated 200 times. The following frequencies of0,l t 2....5 
heads were obtained. 



IMPORTANT THEORETICAL DISTRIBUTION 


191 


No. of heads 0 12 3 4 5 Total 

Frequency 12 56 74 39 18 1 200 

Fit a binomial distribution to the data. 

2f*i 

The mean = 

i 

„ 0x 12+ 1 x 56-J-2 x 74 4-3 x 59 4-4 x 18 +5 x l 

200 

= 1-99 

np — 1*99 
5/7= 1*99 
/> = 0*39S 
<7 = 0*602. 

The theoretical frequencies for 0, 1, 2,..., 5 heads are 

200x(0*602) 5 , 200 x °C l x (0*602> J x 0*398. 200X(0 398) 5 

or 15*814? 52*272, (9*118, 45*694, 15106, I 996 respectively 

A variable takes values 0, 1,2,... n with frequencies 
proportional to the binomial coefficients 

'■OQQ-O 

Jind the mean and the variance oj the distribution and show that 


or 

giving 

and 


(U. P. P. C. S. ’59) 


the variance is half of the mean. 

The total frequency is 

l+"C J + "C 2 4-...+"C fl = (l + l)" = 2". 

Hence the probabilities for 0, 1, 2,...n values of the variate 
are respectively 

L "Q n C 2 "C„ 

2 «» 2 " * 2 '* * ’ ’ * * 2 " * 


This is a binomial distribution (q+p) n in which q=p=k. 

Mean=/7/? = ^, 

Variance = ///?</ = \n. 

Hence we see that the variance is half of the mean. 

5. Calculate the value of p if the ratio of the probability 

of an event happening exactly r limes in n trials to the probability 

of the event happening exactly n — r times in n trials is independent 
of n (0 < p < J). 

The probability of the event happening r times in n trials 

= ”C r p'(l-p)"-'. 



192 


STATISTICS 


The ratio = 


Similarly the probability of the event happening n — r times 
out of n trials = ”C„_ r p n ~ r _ (1 — p) r . 

n C, p' (\-p) n - r 
n C n - r p n r (1 —p) r 

-(' 7T"- 

It can be seen that the only value for the term within the 
bracket which would make this ratio independent of n is unity. 


Hence 




or 


or 


/>•=*. 

6. Bring out the fallacy if any in the statement :— 

The mean of a binomial distribution is 5 and S. D. is 3. 

(Delhi B. A. Hons. ’52) 

We have np — 5, 

V (npq) = 3 
npq = 9, 

giving <7 = 2 /. e. >1, 

which is wrong since q is a probability which must be less than one. 

7. Assuming that half the population are consumers of rice, 
so that the chance of an individual being a consumer is i and assum¬ 
ing that 100 investigators each tike ten individuals to see whether 
they are consumers , how many investigators do you expect to 
report that three people or less are consumers ? 

(Delhi M. A. Eco. Stat. ’58. Punjab M. 4. ’52) 

Here " e have 100 sets of 10 trials each, the probability of 
success being The number of investigators reporting 
0, 1,2,...10 consumers are given by the successive terms of 
the binomial 

100 (A+ *)»». 

The number reporting three or less consumers shall be the 
sum of first four terms 

= 100 {(A) 10 -r 10 (A) 9 (i.) M5 ( Ji 9 .(*)- + 120 (A) 7 0) 3 } 

100 275 

" 2 - * 176 =16 


= 17 nearly. 

8. The following data are the number cf seeds germinating 
out of 10 on damp filter paper for 80 sets of seeds. Fit a binomial 





IMPORTANT THEORETICAL DISTRIBUTION 


193 


distribution to these data :— 



0 12 3 4 
6 20 28 12 8 



The mean = 



5 6 7 8 9 10 

6 0 0 0 0 0 

(Agra B. Sc. Old Scheme ’55) 


_ 20 + 56 + 36 + 32+30_8 7 

80 40* 

Now the mean of a binomial distribution is np. 



'np= 10/7 = 


87 

40 


giving 


/,= 45o =0 ‘ 2175 ’ 


<7=1 — /? = 0'7825. 

The expected values of x are given by the successive terms of 

80 ('7825 + -2175) 10 . 

Thus the theoretical frequencies are • 
*0123456789 10 

/, 6*9 19-1 24*0 17 8 8 6 2*9 *7 * 1 0 0 0 

9. If n coin is tossed n times where n is a forge even number, 
show that the probability of exactly \n —* heads and $« + * tails is 

er 

(Agra M. Sc. ’47, ’51, ’58, B. Sc. ’55) 


As in § 8*2, we have 

f(x) = n C x p'q n -* 


/(x+l)="C x+1 /7*+V-*-i. 

. /(*+ ILn-x y /J 

f (x) x+1 q' 

Since n is given to be an even number, let n — 2k, where k is 
a positive integer. 

fix 4- 1 ) >/(*) 

so long as in—x) p "> (*4 1) q 

or x < k — \. 

Hence f (k) is the greatest term in the distribution. 

Now 

since p ~q = h 



194 


STATISTICS 


or 


/ (k +x) = ——H_ /1 ) 2 * 

(k + x) \ (k—x) l 

/(x + k)^ (k !) (k !_) _ 

J'(k) (k+x; ! (k —x) f 

According to Stirling’s theorem, if //is very lar^e 
approximately “ ’ 

h l = \/(2n7T) n n e~ n t 
= \Z(2tt) n n +h e ~ n . 

Applying Stirling’s approximation, 

fr+k- _ 2nk ~ k + l . e~ 2k 

ft ~2rr.[k-t~X) k + x +*~ (XT— 


we have 


or 


or 


I 


ft+k 


0 + 1 ) 


X\ k +*+h 





= -(*■■ +X+I) log. (l+*) 




-(*-*+5) log. (:-£) 


+(*-*+!) C+^+.. ) 


X 


as k is a large quantity 


f J+ * ~ft e :x " n as A' — \n. 

A„ + X =S "C I . e -2**/n 


— 2.x 2 / 


nil ! ;i/2 ! e 

r) n e -2,Vn 

2n [*J +l e - ^ 


/ *> \i/a 

= (' ) e-2x'/\ 

V 77/// 

8 9. Poisson Distribution. 

Define a Poisson ran Jon, variable ami give some physical sittua■ 
twns illustrating ,t. Find out its mean and variance. 

„ . „ ° r ( A S ra M Sc. ’62, B.Sc. Agra ’59) 

Derive Poisson's exponential limit 



IMPORTANT THEORETICAL DISTRIBUTION 


195 


. m % m 3 \ 

e ' m 

for the binomial distribution (q-\rp) n . Discuss carefully the atten¬ 
dant conditions. (Agra M.Sc. 57, 58) 

A variable x is said to have the Poisson distribution if it takes 
the values 0, 1, 2, 3,...inf. with probabilities given by 

me~ m tn 2 e~ m 

e ,_ iT’ .. 

respectively. 

The Poisson distribution is generally derived as the limiting 
form of the binomial distribution (<?*f -p) n when n-+oo and 
p-> 0 so that np is a finite quantity, say m. We come across cases 
in which the probability of success is very small. Such events are 
known as rare events. The following are some of the physical 
situations illustrating the Poisson distribution. 

(1) Number of deaths from a disease (not in the form of an 

epidemic) such as heart attack. 

(2) Number of printing mistakes per page of one of the 

early proofs of a book. 

(3) The number of defective material per packing manufac¬ 
tured by a good conern. 

(4) The number of articles of a certain merchandise sold by 
a concern in time /. 

(5) Number of telephone calls received at a particular 
switch board per minute. 

(6) Number of fly bombs falling on London during World 
War II in a certain area. 

(7) Number of cars passing a crossing per minute. 

In a binomial distribution 


/ {x) = "C m p*q n - 

_ n (n - I )(n — 2). . An - x 4- 1) ^ ( { _ p y>-. 


x ! 


V "A a) >■ n ./ px ( j —p) n ~* n 


x ! 


C'-iX'-I)-('-S) 


= x ! 

Taking limits as p-> 0 so that np=m. 


{np) 


o-rr 



196 


STATISTICS 


/ (*)=J 7 mFer m , 

('-TT-^L (■-?)',iL 

a: being a finite quantity. 

Hence the probabilities of 0, 1, 2. 3,..successes are 
given by 

p—m e~ m .m er m m 2 e~ m in 3 e^m* 

* 1 f » 2*T"* • • • respectively. 

Note * The above result can also be obtained by applying 
Stirling’s theorem for n ! as n-*-oo. 

8-10. Mode of the Poisson Distribution. The value of * 
w ich gives the greatest probability is the mode of the Poisson 
distribution. Thus for mode to be at x 9 

1 e~ m m x 
(x-1) ! < x ! ~ 
and e~ m ni x 


(x-l) | ^ x\ ~ 
and e~ m ni x e^nr *** 

xl ! 

giving m— 1 < * < 

lvine Tl h^J, f '" iSn °! an ' ntegCr ’ the mod e ^ the integral number 
lying between ,/,-landm. If m is an integer, there are two 

beingequar ch at andthe Probabilities in both cases 

8-11. Constants of the Poisson Distribution. 

(Agra M. Sc. ’50) 

/*/=* Q. e ~ m -f • 2 -e~ m .m 2 r.e- m .ni r 

1! +—21-+ ••• + —-+ ... 


*»• 

• 

E 

1 

tu 

II 

l' + r 

=m .e~ m , 

e m 

=m. 


the mean 

is m. 

00 


_ z r ~- 

e~ m .w r 


r=0 


= £ W r - 1 ) -f r} m r e~ n 
r =0 r ! 

= Z ™ 2 ™ r - 2 e- m . £ m.m^Xe-m 
r=2 (r— 2)1 t=\ (r—ii: 

~e~' n [m 7 e m +me m ] 

-=w 2 -f//;, 



IMPORTANT THEORETICAL DISTRIBUTIONS 


197 


Hence the variance p^ = pf—H-i 2 


=fW. 

giving S. D.=y/m. 

Similarly putting r 3 =r (r—1) (r—2)4-3r (r— l) + r, we get 

Hz'=m (m 2 + 3m+l). 

Also putting r i = r (r— 1) (r—2) (r— 3) + 6r (r— 1) (r— 2)4* 

7r (r— I)-fr; we get 

/x 4 ' = (m 4 + 6m 3 4- 7m 2 4-w) e~"‘ • 

=m (m 3 4-6m 2 4-7m 4 - 1). 

Now fi 3 = /x 3 —3/x 2 /x,' /*/ 3 


and 


= m, 

+ —3/V 1 

= 3/w*4-w # 




/x 3 2 m 2 1 0 ^4 3m- 4- m 


m 2 


7i = 


_ __ • n = r " 

p 2 ~~ m 3 ~ nt 5 2 # x.> 2- 


3 + m’ 


8*12. Solved Examples. 

1. S//mv that for a Poisson's distribution y lV 2 oM=l where 0 
and M are the S . Z). and /neon respectively. 

(Agra M. Sc. ’47, *56, ’60, ’63 ; Agra B. Sc. ’56, '59) 
Substituting the values from $ 811, 



1 1 . 

Vi 7 i<jM=-r 

y/m m 

Show how the Poisson distribution 


r w <B* 
x f 


(x = 0, l, 2, 5.. .7 


can be regarded as the limiting form of the Binomial Distribution. 
Hence or otherwise, obtain the mean and the variance oj the Poisson 
distribution assuming the variance of the Binomial distribution. 

(I. A. S. ’55) 

The Poisson distribution is obtained as the limiting form ot 
the binomial distribution when //->:*>, where np-m so that p -*0 


The first four moments of the binomial distribution are : 
l*i=np, ni=npq, /x 3 =npq (q—p), p* = 3 ( npq)- + npq (1—6 pq). 

The moments of the Poisso 1 distribution can be deduced as 


/xj'^Lt (np)=m, 

P 2 — Lt (npq) = Lt {np (l-p } 

= m since p-*- 0, 



198 


STATISTICS 


/^3 — Lt {npq (q-P)}= a _ >Q {'”<1 ((]-p)} 

~m since t/=l as p-*- 0. 

^4 = Lt {3 (npq)* + npq (1—6 pq)} 

= Lt {Znr+mq (1 — 6 pq)) 

= 3 m % +m. 


mean = /?? and variance = /x 2 = //i. 

3. Bortkiewicz has given the following data af men killed 
by the kick of a horse in certain Prussian army corps in twenty years 
(1875-94). Calculate the theoretical frequencies taking the distri¬ 
bution to be Poisson. 

Deaths 0 1 2 3 4. 

Frequency 109 65 22 3 1 . 

(U. P. C. S. ’60) 


Here the total number of deaths is 

(05 x I 4-22 x 2 4-3 x3fl x 4) = 122, 

1 22 

and hence the mean death per army corps is ^-^=‘61. 
Hence m = *61 and A r = 200. Also <?”* 6, = 0*5433. 


The theoretical frequencies for 0, 1, 2,3,4 deaths respec¬ 
tively are 

•■>(), _. ei 200 x e - -" 1 x 61 200 x e*"* 81 x (*6h s 

~ ’ 1 ! ’ 2 ! ’ 

200 x e~ 01 x (-61 ) 3 200 x er ,6 ‘ x (‘61 ) J 
3 ! » 4 i — * 

i. e. Deaths 0 12 3 4 and over 

Frequency 108*7 66 3 22*2 4 1 0 ’ 7 . 


4. Six coins are tossed 6400 times. Using the Poisson distri- 

o 

but ion, what is approximate probability of getting six heads x 
tines- (I. A. S. ’55) 

The probability of yetting all the six heads in a throw of six 

1 I 

CO, " 5 = 2» = 64- 


1 Ience the mean — up = 6400 x - 0 l , = 100. 

The probability of getting six heads x times according to 
Poisson distribution is 

-^5. In a certain factory turning out razor blades , there is a 
small chance - ft - l o for any blade to l e defective. The blades are 



IMPORTANT THEORETICAL DISTRIBUTIONS 


199 


supplied in pickets of 10. Use the Poisson distribution to calculate 
the approximate number of packets containing n o defective, one 
defective and two defective blades respectively in a consignment of 
10,000 packets given that e ~ m02 = 0 9802. 

(Agra M. Sc, ’55, ’59 ; Luck. B. Sc. ’47 ; Agra B. Sc. 56 ; 

U. P. P. C. S. ’57) 

P== W0’ n=10 '’ "=io,ooo 


i. e. 


m=np= ~ =02, 
e- m =e-*o2 = 0-9 8 02. 

The frequencies of 0, 1, 2,3... defective blades arc given 


by the successive terms of Ne~ m , i. e, 10,000 


.02 


.( 02 ) 


r ! 


Number of packets with no defective blades = 10,000 x <?- °- 

= 9802. 

Number of packets with one defective blades 

10,000 x.e- - 02 x 02 


1 ! 


196. 


Number of packets with two defective blades 

10,000 xr - 02 x( 02) 2 
~ ~ 2 ! 

= 1*96=2 nearly. 

6. A manufacturer of cotter pins knows that 5 per cent of his 
product is defective. If he sells pins in boxes of 100 and guarantee s 
that not more than 10 pins will be defective , what is the approxi¬ 
mate probability that a box will fail to meet the guaranteed quality ? 

/; = *05, n~ 100, 

so that m==np==5 

According to Poisson distribution the probability of r defective 


pins is 


•6 (C\r 

— * r=0, 1, 2, 3... 


Hence the probability of 10 or less than ten defectives is 


J J? x 5 r 

did - - * 

r — 0 r ! 


The probability that the defectives are more than ten is 

e -G x Cr 


100 

27 

r— ll 


= 1-27 

r-0 


10 e ~ b x 5 r 


f 



200 


STATISTICS 


7. A large number of observations on a given solution which 
contained bacteria were made taking samples 1 . c. c. each and 
noting down the number of bacteria present in each sample . 
Assuming the Poisson distribution and given that JO per cent samples 
contain no bacteria , find the average number of bacteria per c. c. 

(Delhi M. A. Statistics ’59) 

Let the average number of bacteria per c. c.=m, so that 
the successive probabilities of Poisson distribution are 

m e~ m .m e~ m m 2 e~ m .m 3 

e , -j-f-. 2! ’ "T! 

Hence the probability that the sample contains no bacteria 
= e~ m , so that e~ m =’l 

or e m — 10 

or i/i=log„ 10 

= 2*3026. 

8. A Poisson • distribution has a double mode at x=4 and 5. 
Find the probability that x will have either of these values. 

p ( A - = 4) = P (x=5). 

e~' n .m* e- m .ni h . . . 

Hence ~ "4!~ ~ = ~~5 T ~ * 81ving m = 5 ‘ 


Now 

Hence 


<r 6 =-006738. 



(a*=4) = 


•006738 x5‘ 

4 ! _ 


The required probability = P (x=4) + P (x = 5) 

= 2/-(*=4,=2x~-^ 

* 

= •035. 


It is easy to show that P (4) > P (3) and also greater than 

P (6). 


9. Red blood cell deficiency may be determined by examining 
a specimen of the blood under a microscope. Suppose a certain 
small fixed volume contains on the average 20 red cells for normal 
persons. Using Poisson distribution , obtain the probability that a 
specimen from a normal person will contain less than 15 red cells. 

(Agra B. Sc. ’61) 

Here w=20. 

Hence the probabilities of 0. I, 2,.. .r.. .red cells are given by 


<r 20 , 


e” 20 . 20 e 
1 ! * 


-20 


( 20 ) : 


,-20 


. ( 20 > ! 


2 ! 


.. .respectively. 




IMPORTANT THEORETICAL DISTRIBUTIONS 


201 


The probability of less than 15 cells is 


14 

Z 

r=0 



given by 


10. If m is the parameter of a Poisson variate , show that the 
probabilities that the value of the variate taken at random is even 
or odd are e~ m cosh m and e”"‘ sink m 

respectively. ' (Punjab M. A. ’53 S.) 

The probability that the variate takes a value r is 

-i— , r=0, 1, 2, 3.. .oo . 


The probability of even value of the variate is 


m 


4 

4 ! 

— e~ m cosh m. 


{ 


t , nr m* m 
1 + 2 -! + 4 ! + 0 


S+-} 


Similarly the probability of odd values is 




( , m 3 m r * , | 


= e~ m sinh m. 

lfc —"Fit a Poisson’s distribution to the following data and 
calculate the theoretical frequencies :— 

Deaths 0 J 2 3 4, 

Frequency 122 60 15 2 1. 

(Agra M. Sc. ’49, ’54, ’57) 

Zfi*i 
*fi 

_ 60+304-6 + 4 

“"122 + 60+15 + 2 + 1 
Now e~ • 6 =*61. 


Mean = 


= •5. 


Hence the theoretical frequency i.e. the number of r deaths 
is given by 200x e ~‘ 6 ( ' 5r > where r=0, 1, 2, 3, 4. 

The theoretical frequencies are 122, 61, 15, 2 and 0 for 
0, 1, 2, 3, 4 deaths respectively. 

S 12. Find the probability that at most 5 defective fuses will be 
found in a box of 200 fuses if experience shows that 2 per cent of 
such fuses are defective. (Agra M. Sc. ’61) 

The probability of a fuse being defective is A, which is small, 
and hence the distribution can be taken to be Poissonian. 

P ^ A. n — 200. 



202 


STATISTICS 


m—np= 4. 
£- m =<r-4=*0183. 


The probability of 5 or less than 5 defective fuses is 

r=0 r ! 

<r* e -4 #4 ^ ( 4)2 e-».(4)3 <4) 4 e” 4 (4) 5 

= 0l + 'TT + _ 2l“ + 3! + 4!, + 5! 

-0,.3{, + 4 + 8 + B + f + !g} 

= •785. 

13. If X and Y are independently distributed as Poisson 
variates with parameters A and p respectively , find the probability 
distribution of X+ Y. 

f (Agra M. Sc. ’62, B. Sc. ’63, Delhi B. A. Hon’s ’61) 

The variates take the values 0,1,2, 3... In order to find 
the probability of r successes, we require the probability in 
which the sum of X and Y is r. Since X and Y are independent 
variates, the probability that X takes the value s and Y the value 
r—s simultaneously is 

A\2~“ x ix T - 8 e'~V- 
s ! * (r—s) ! 

Since s can take all integral values from 0 to r, the summa¬ 
tion of the above probability distribution when s varies from 0 
to r gives the probability distribution of X+Y. 


/(A'4-y)=c“ x - (1 f 

5—0 


. Ay- 

s ! (r—s) ! 


= e -(*+v-)?L z 


r ! s ! (r—s) ! 



^-(X+n) 


’('+ 3 - 


A\ r e-(x+n) (jx-f A) r 


Hence we see that the probability distribution of X+Y is a 
Poisson distribution with mean A-f jx. 

14. Letters were received in an office on each of 100 days. 
Assuming the following data to form a random sample from a Poisson 
distribution, fit the distribution and calculate the expected frequencies, 
taking e-*= 0183. 



IMPORTANT THEORETICAL DISTRIBUTIONS 


203 


Number of 

letters (x) 
Frequency (f) 


Here 


0 1 2 3 4 5 

1 4 15 22 21 20 



6 7 8 9 10 

8 6 2 0 1 

(Agra M. Sc. ’48) 


The expected frequencies are given by 

100 x c -4 x (4) r . 

-.—-—, where 

w 


r=0, 1, 2, 3, . 


• » 



On calculation, the frequencies come out to be 


1*83. 7*32, 14-64, 19*52, 19*52, 15 62, 10*41. 5*95, 2 975, 1*322 
and *529. 


15. In 1000 extensive sets of trials for an event of small 
probability, the frequencies f of the number x of successes proved 
to be 


*< 01234567 

f 305 365 210 80 28 9 2 l 


Find the theoretical frequencies and verify that the variance of 
the given distribution is I’28. 4 (U. P. P. C. S. ’58) 

N= L /= 1000, 

0*305+0 x365) + (2x210) + (3x80) + (4x28) 

Z fi*i _+(5x 9) + (6x 2)+(7 x I) 

27 /< = 1000 



And variance= 1*279. 


The theoretical frequency for x r is 



assuming 


a 


Poisson’s distribution. 

Now e** 1 ' 2 —0*3012. 

The calculation of theoretical frequencies is shown below : 

x t 01234567 
U 301*2 361*4 216*8 86*7 26*0 6 2 1*2 02 


* / 

0 1000e —1 * 2 =301*2, 

1 1000f?- I,2 x 1*2 = 361 4, 



204 


STATISTICS 


* / 

2 1 000e- 1 * 2 x 216*8, 

3 10C0e- 1#2 x ^? 3 = 867, 

4 i000e~ 1,2 x ( -^= 26*0, 

» • 

5 1000e- 1 ’ 2 x^p= 6*2, 

6 1000e —1 ’ 2 x “t~? 6 = 1*2, 

o ! 

7 1000e- 1 ‘ 2 x-- 2 , )7 = 0-2, 

9997 . 

16. Criticize the statement : 

The mean of a Poisson distribution is 5 while its standard 
deviation is 4. (M. Sc. Agra ’61) 

We have already seen that for a Poisson distribution, 

a= \/m, 

which is not satified by the above data. Hence the statement is 
not correct. 

17. A car hire firm has two cars which it hires out day by day. 
The number of demands fgr a car on each day is distributed 
as a Poisson distribution with mean 15. Calculate the proportion 
of days on which neither car is used and the proportion of days on 
which some demand is refused. 

(U. P. C. S. ’53, B. Sc. Agra ’60, B. A. Hons. Cal. ’55) 

m=l*5, 
e-"» = 0*2231. 


The probabilities of 0, 1,2,... demands are given by 

r 1,6 xl*5 r 1>5 x(l-5) 2 


1.5 


1 ! 


2 ! 


. .respectively. 


Hence the proportion of days on which no car is used is 
0*2231. 

Some of the demands will be refused if the demand exceeds 
the number of cars available i. e two. Since the total probabi¬ 
lity is unity, the required probability can be obtained by subtrac¬ 
ting from one the sum of the probabilities of the demand of 
0, 1 and 2 cars. 

2 g—1 -5..,r 

The required probability = 1 —27- v — 

r=0 r ! 



IMPORTANT THEORETICAL DISTRIBUTIONS 


205 


= l_ e -i.6 {1 + 1-5 + 1*125} 

=•19126. 

18. If x is a Poissonian variate with mean m, what would he 
the expectation of e~ kx where k is a constant. Find also the 
expectation of e~ k *.kx. (I. A. S. ’48) 

The probabilities of the x variate distribution are given by 

iy\ 2 (>~ in 

the successive terms of —r- where x=0, 1, 2, 3... 

The expectation of e “** is 

E (e~ kx ) = £ e~ kz m*e~ m 
x=o x l 

=e-™ £ m * e Z* 

x=-0 x ! 

= e -m 2 ’ (me-^}* 

x—0 X ! 

=e~ m exp (me~ k ) 

— e ~m (i—r-k) 


Also E (kxc- k *)= £ kxe- k *m* e- m 

* = 0 x ! 


= ke~ m £ e ~ kx ™* 
x= l (x— I) ! 

f m*p- 2k 

= ke- m < me~ k -f —+ 




2 ! 


, +.. 



= ke~ m .me~ k exp (me~ k ) 

=mk exp {me~ k —k—m} 

19. Two dice are thrown until a seven is obtained. Find the 
most probable number of throws and the expected number of throws. 


(M. A. Punjab ’56) 

The probability of getting a seven with two dice is £ since 


seven can be thrown in (1,6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1) 
i. e. six ways and the total number of ways in which two dice 
can fall is 6 x 6 = 36. 

Hence p= J. 

The number of attempts and the corresponding probabilities 
for throwing seven are as under : 

Attempts 1 2 3 4 5... 

Probability p qp q 2 p q*p q*p.., 

Expected number of throws 

= (/> x 1 ) + (/>? X 2)+ (ryV x 3)+... inf. 
=p (1-f 2r/ + 3r/ 2 + .. 



206 


STATISTICS 


~p (1 —</)“*=- since \—q=p 

= 6 . 

The most probable number of throws=/7/? 

= 6 x & 

= 1. 

20. Let A and p r denote the mean and rth moment about 
the mean of a Poisson distribution respectively . Obtain the follow¬ 
ing recurrence formula : 


Pr+l — ^rPr-l^-^ 


d\ * 


or 


or 


(Punjab M. A. ’56, Delhi M. A. ’56, ’60, Patna iM. A. ’56) 

°° c~ ) xj 

Now ti r = 2 — (j- A)', 

j=0 J ' 


dp r _ 

dA 


r-Vr V U—W+TT U’M~ l (j-W 

j L J • J • 


— X 


-rX> O-A)-*)] 




A 7fc- s yr xi O'—'r u-\)-r\ 2 Ai (y-A)'-> 
w j J ’ j J 1 

= A*r+l— /■'Vr-l 

/x r+1 = A -~ + rA ^ f _ 1 


21. Comment upon the following statement : 

“ Poisson Distribution is of such frequent occurrence that it is 
not proper to consider it as only a limiting case of binomial distri- 

bution ” ” (I. A. S. ’61) 

The statement is by and large true since Poisson distribution 
has been found applicable in a very large number of cases, the 
most important being Queueing Theory. It can be proved that 
if the mean rate of arrivals at a certain service is A and the arrivals 
are random, the probability of r arrivals in time t is given by 

(A t) r 
r! 


which represents a Poisson distribution. The Queueing Theory 
has been developed into a full subject and the above result has 
been derived without any reference to the binomial distribution. 

In binomial distribution we have to know not only the times 
an event has occurred but also the number of times it 
could have occurred but did not. In many cases this probability 


IMPORTANT THEORETICAL DISTRIBUTIONS 


207 


has no meaning just as the mistakes in the proofs of a printed 
book, since we have nothing to do with the number of times a 
mistake could have occurred but did not. 


Poisson distribution has found applications in many biologi¬ 
cal occurrences. The number of radio active particles emitted 
over a specified number of intervals is distributed in Poissonian 
manner. In fact, in cases where the possible number of 
successes is very larg and may even tend to infinity ( e . g. the 
number of arrivals of customers, the number of deaths by a disease, 
though rare, the mistates per page in a printed book), the 
binomial distribution fa*. It is not uncommon to come across 
such cases in daily life, and hence Poisson distribution has been 
found to be of wide application. 

8*13. Multinomial Distribution. Suppose a single trial of 
an event results in one and only one of the k possible outcomes 
, x 2 ,... x k with respective probabilities />,, p i9 .. .p, : and also n 
independent trials of the event be performed. The probability 
that out of these rt trials, x x occurs n x times, x 2 occurs n 2 times and 

k 

so on so that E n t =n, is 

« = l 


__ ^ * _ _ n n l n n, n K 

(»i !)(».!) (if. !)...(«* !) Pl P * •*' Pk ' 



Proof. Out of the different possible orders in which x x can 
occur n x times, x 2 occurs n % times,..., let us consider the parti¬ 
cular order 


x lt x xt x x 


•., x xt x 2 , 


*2» ^2 » * • ••T*, ••• Xj-, X t , Xj;, . . . , X|; 


J e 


r. t limes n, times n k times 

Obviously the probability of the above sequence of happen¬ 
ings is Pi ni P 2 nt Pz n *.. .p k nk . Also the number of different possible 
orders of this occurrence is equal to the number of ways in which 
n things can be arranged, n x being alike, n 2 being alike.. .which is 
equal to 

_ n ! 

(«i !) ("2 !)...("* 0 

and hence the required probability is 
p . !)•••(««!) 

This is known as multinomial distribution since (3) is the 
general term of the multinomial expansion 

(P1+P2+ - • . Pk) n 


...( 2 ) 



208 


STATISTICS 


I 

( 

f 


and (2) is the coefficient of p* 1 p "*.. .pt nk in the above expansion. 
It can be seen that the multinomial distribution is the generalised 
form of the binomial distribution. 


8*14. Hypergeoraetric Distribution. Suppose an urn contains 
N balls of which Np are black and Nq white so that p + q~\. 
The probability that if n balls are drawn (without replacement), 
exactly x of them will be black is given by 





since out of Np black balls x can ke^rawn in ^PC X ways and 

out of Nq white balls n —* can be drawn inways. The 
conditions are that 0 < x < Np; 0 < n—x ^ Nq. It can be 
easily seen that / (x) represents a probability density function 
known as Hyper-geometric distribution. For, we have 


Np 

*/»=1 


*=0 


sine 


Np 


2 WC X Ny C n _ x / ^C n = 1 


X — 


Nf> 


as 2 NPC X #1C n ^=*P+N qC 

x=0 

=^c„ 

which can be proved by equating the coefficients of x n on both 
sides in the product 

(I +X*P (1 +X-)J V ?={1 a;+ ... +*tc n x "+...} 

x {1 + jv »C 1 *+-^C 2 * 2 + ...} 

giving the above result. 


Hence f (x) represents a probability density function. 

Now let the drawings be by replacement after each draw of 
a ball. In such a case the proportions of the balls remain 
unaffected by the draws. The same result can be obtained by 
making N-*-oo . Thus we get the limiting form of the hyper¬ 
geometric distribution. 

Lt WC, Nq C n . x 

N —►co NC n 

= Lt (Np ) ! (Nq) ! _ n 1 (N- n) ! 

N-+oo (x) l (Np-x)'(n-x) ! (Nq—n+x) !* N ] 



IMPORTANT THEORETICAL DISTRIBUTIONS 


209 


Lt 


{Np ( Np — 1) (Np—2 ).. ,{Np—x+ 1 )} 
n ! xfW? (Nq— 1) [Nq — 2 ).. .(Nq-n+x-\- 1)} 

(AT— 1) {N-2)77.\N—n-\-\) 


N-*-oo x ! (n—x) ! 

x{? (?-*) G-—Ar”)} 


Lt 

N->oo 


" C x 


O-^O-sO-O-V) 


= n C x p*q n ~ x , 

which is the binomial frequency distribution. Hence if the 
drawings are by replacement, the distribution is binomial 
which was to be expected, since in this case p and q remain 
unchanged which represent the probabilities of the draws of black 
and white balls respectively in a single trial. 

The generalised form of the hyper-geometric distribution is 
that if there are balls of r colours in proportions given by 
P\, P 2 . • • •, Prt so that Pi+/> 2 + .. .+p r = 1. the probability of 
drawing x x balls of the first colour, x 2 of the second and so on 
without replacement is given by 

A>, c *PtC x 

/(**,) =- *' -- - 

where x x -\-x 2 + ... -\-x r = n and x x < Np x , x 2 < Np i% .. 

*r < Np r . 

If the drawings are with replacement, the distribution is 
multinomial given by 

/(*», ^2» • • •» x r ) — - r - | - — | p x * p 2 z .... p r Xr * 

Mean and variance of the Hyper-geometric Distribution. 

= Z {x *PC, 

x = 0 
n 


= Z {x WC, } 

A=«l 


n 


= Np£ {*/■- 1 C._, AVC\,_w v C„> 
~NpW+*i-')C 9 J*C n ) 

= Np f - /V-1 )C n _,/ v C„ since p + q—\ 

—tip. 






210 


STATISTICS . 


Now 


Ms' = 2 X 2 f (x) 

x=0 


n 


= 2 (x (x-l)+ x}f(x) 

x=0 


n 


Hence 


2 x (x-\).f(x)-\ n\. 

x—O 

*=o M c„ 

_ % r(x-\ )Xt’C x *1C„- x 
"-2 

" (A'p) (Np—}) (Np—2)n. MiC, 


= 2 
x=2 


x-2 w n-<r 


A C„ 


n-2 


^ Np{Np—\) e {j\p+.\r q -2) C _ 
N C 

Np (Np—\) {N _ 2 ) 

“ »cT~ n ~ 2 


• • 


np (/;— 1) (Np — 1) 

“ N-\ 

,_np (n—\) (Np— 1 ) 


M* = 


7V-1 


+ W/7 


/ « AT? — n — AT? -1- ;V 1 

=" p {— N-r )• 


Ms—/*— mV 


inNp—n— Np+N\ ' 2 

-'"I -jr=i r np 

n (N—n) pq 
~ JV -1 


If N-+oo , we get the variance of the binomial distribution on 
taking limits as npq. 

Example. An urn contains 7 black and 3 white balls. If 
5 balls are withdrawn , find the frequency function for the number of 
blackballs obtained: (a) if drawings are made with replacement , 
(b) if drawings are made without replacement . 

If drawings are made with replacement, the probability that 
x balls drawn are black is given by 

* ( * )= Vl (S-x) ! ( ‘ 1 ^® 


IMPORTANT THEORETICAL DISTRIBUTIONS 


211 


If drawings are without replacement, the density function is 
hypergeometric and the probability of x black balls drawn is 

7 C, 3 C 5 _,/t°C 6 

8*15. Normal Distribution. 

Introduction. The normal distribution known as error func¬ 
tion was introduced in 1733 by the celebrateJ mathematician 
De Moivre as the limiting form of the binomial distribution. Later 
Laplace and Guass derived it independently of each other as the 
distribution of errors in physical measurements. Normal distri¬ 
bution has got wide applications in the theory of statistics. 

8*16. Derive the normal distribution as the limiting form of 
binomial distribution (p-\-q) n as n-*-x>, neither p nor q being 
very small. (Agra M. Sc. ’48, ’52, ’54, ’55, B. Sc. ’56, ’60, ’63) 

In the binomial distribution, the probability density function 
for r successes is given by 

f (r)= n C r p'q'-\ 
f(r+l) = n C r + 1 p'+*q n -'-'. 

. f (r+ I )_ n Cr+i P 
** f (r) ~ "C r ‘q 


n — r p 

”>+ i q 

so that f (r-\- 1) > f (r) so long as 

(n — r) p > (r+ 1) q 

or r (p+q) < np — q 

or r < np — q. 

Let us assume that np is a whole number, the assumption 
being justified since n->x>. In that case f (np) shall be maximum. 

n ! 


Now 




and 


/ (np+x) = 


n ! 


nv+* * qnq-x 


(np + x) ! ( nq—x) ! 

• f ( n P+x) _ npf nq \ _/ p\ * 

f (np) ~ (np +x) ! (nq—x) ! \q) 

Using Stirling's approximation, 

f±"P+x) 
f (np) 

= _ _ VCln) e- nt> \Z(2 tt) e~ nQ (nq) n(t+ 1 

V(2 n) e- (nt,+ *> (np -f x')"’+*+ * V(*") e ~•»«-** (nq-xj n *—+l 


o 



212 


STATISTICS 




X \• 


log„ 


f (np+x) 
f (”P) 


-(np+X + \) log a (l + ^)-(#I0—X + |) k>g a (l — 

, . . „ / * l * 2 1 -X 3 \ 

~(„p+x+ i) (~~ 2 n *Jt+J „» P 3-"-) 


+ in q - x +i) (^+ 


»j T 2nV T 3»¥ 


2 + 




2// \p q) 2npq^~ 4n* \p 2 ^~ q 2 ) 6/i- Vg* p % ) 


4-terms with higher powers of 


** , tAEl+31 , (^-P> 

2npq+ 4n*p*q 2 + 2 npq 


(- 


*+ 


3npry 


^-+* • • • 


neglecting the terms of the order of — and higher orders. 

If /i is large, the second term on the right hand side is very 
small in comparison with the first term; moreover, if neither p nor 

q is very small, we have , a sma *l quantity, so that 


Vi^pq) 

lor ./»!HLt£> 
f \np) 


2 npq 


2 a 2 ’ 

where npq—a 1 , the variance of the binomial distribution. 

/. f(np+ x )=f(np) e 2 ° 3 - 

Denoting f (np) by y 0 and /(np -f- x) oy y Mt we have the 
equation of the normal curve as 

_ 

y=y 0 e 2a \ 

where the origin has been taken at np successes in place of zero 
success and x is the deviation from np. 

If the origin is taken at 0 successes, the equation of he 
curve shall be 

_ (x-wr 
>’x=y 0 e 2 °* • 


IMPORTANT THEORETICAL DISTRIBUTIONS 213 

8*17. Properties of Normal Distribution. 

(Agra B. Sc. ’56, ’59) 

The general equation of the normal distribution is 

_l sx— m y 

y=y 0 * 2 ' a ' , 

where m is the mean. The curve is symmetrical about x = m and 
l.cs wholly above the «xis. The frequency tapers oil to zero 

3 f ,* * b ° th S,des of t]le The median and mode coin- 

c.de with the mean at * = m and y decreases rapidly as * increases 



y = y 0 e 2o \ 

the origin b;ing taken at the mean, the value of y 0 is calculated 
by making the total area under the curve to be unity. This gives 

1 =y 0 j" dx 

== 2y 0 f* e-H**/* 3 ) dx 

J 0 

=MX\/(2ir) 

8 ivin 8 y 0 =-jX—. 

V( ;7r i o 

ThUS y== V(L )<7 e ~ x ' l2ni has unit area and is called 
Normal Probability Curve. 

If the total area is N, the equation becomes 

y= _ N — e-x'tto' 

V(2 

The points of inflexion of the normal curve, where ^ - = 0 and 

cix*- 


d*y 


r 

Ux-^O are given by x= ± a when the origin is taken at the mean. 



214 


STATISTICS 


8*18. Constants of the Normal Distribution. 

Let us consider the equation 

( x-a )* 

1 2o a 

y y/(2na) & 

the total frequency being unity, the assumed mean being at zero. 

(x-a) 8 


H-i 


'= _!_ [ 

V(2w) a J 


to 2a* 

xe dx. 


Putting X - /n a =t and dx—y/2adt. 




dt 


-0+ 2 4: f «-**</* 

J 0 


V 


TT 


2a -v/ 7T 

-- --— = n 

~y/n * 2 

Now transferring the origin to (a, OJ, the mean of the distri¬ 
bution, the equation becomes. 


y= 


1 0 —x 8 /2a* 


Hence 


\Z(2tt)o 

u — _ 1 _ f* x n e~ x ' /2Gt dx 

V"- va*)* i 


The value of the integral is zero when n is odd and hence all 
odd moments about the mean vanish since x n e~ x ‘/ 2al is an odd 
function of x. 


If n is even 


l l n = 


V(2rr) 


f *”e-*V2o 8 
J o 


dx 


X* X 

put or -j dx^du 


2a 


_?*** r e —«»-.» 

Jo 


a n .2 n,i 

— 1 \~ 2. 

As a particular case, we have 

o 2 .2 


m 3 = the standard deviation is a. 

y/TT 


Pi 


= 3 a 1 . 


Similarly 


IMPORTANT THEORETICAL DISTRIBUTIONS 


215 


^ ence /?i=0 so that y 1 =0, 

antl /?2 = 3 so that y 2 = 0. 

Thus the normal curve has zero kurtosis. 

The mean deviation about the mean 

* vSw f. ■" 

= \/ G) ct= « ct nearl y- 

A relation between the successive moments of even order can 
be found as follows : 

—J . [’ dx, 

V(2tt)gJ 0 

Integrating by parts, we get 

[- xlr - le ~ ,,l 2 a ']y^r I* * 

It can be proved that the first term within the brackets in the 
k. o# vanishes when x->co. 

Hence * r - (2 ,J|) 

Since /x 0 =l, 

we have /-t 2 =o 2 , 

^ = 3a\ 


• • • 


**, r =(2r-l) (2r-3)...3j. a 2 Vo 
= (2r-l) (2r—3)...3.1 .a 2r . 

,1 . . Son,c Further Properties of the Normal Distribution. 
If in the probability differential 


rf/ =v<L, 


we substitute 


f. we get 


df- 


1 


e /,/2 


V ( 2tt) 

w ich is known as the standard form of the normal distribution. 

ca " ° e se ™ that / is normally distributed with 0 mean and unit 
standard deviation. 



216 


STATISTICS 


For the standard normal curve 


* < 0 - 


1 




V(2*> 

tables have been prepared which give (i) the values of the ordi¬ 
nates for values of t, the ordinate at the mean where /=0 is * t > 

v(^ 7r > 

(ii) the areas of the curve lying to the left of the values of t i. e. 
between t= — oo to t or the areas of curve between the values 
t—0 to t. We have 


j‘ 4 CO dt= J° * (/) <rt+j% ((> <*, 



since the total area under the curve extending from /=—- oo to oo 
is unity. 

From the tables it may be observed that the area lying out¬ 
side /— ± 3 is *0027 /. e. the area between t— ± 3 is 99% of the total 
area of the curve. It means that the probability that a value of 
x lies outside M±3<j for a normal variate is *0027 where M is 
the mean and a the S. D. Similarly the area within/=±2 is 
95% and within /=±1 is 66% nearly as shown in the figure 
below. 



8*20. Probable Error. 

The quartiles Q 1 and Q z are equidistant from the mean. We 

have 

J* <f> (0 dt='15 

i. e. for which [ </> (r) <* = • 25. From the tables this is /=*6745. 

J o 

Hence ,, 

(?3=w-F*6745ct, 

Qi = rti — ’6745a. 



IMPORTANT THHORI-.TICAL DISTRIBUTIONS 


217 


The semi-interquartile range for a normal distribution will 
be denoted by E . 

6745a. 

This E is known as Probable Error. For normal distribution 
M±E is the range within which 50% of the variates lie i.e. 
the area of the normal curve between m—E and m + E will be 
half of the total area under the curve, or, in other words, the pro¬ 
bability is one half that a variate selected at random will have a 
value between m — E and nt + E. 

8*21. Importance of the Normal Distribution. At one time 
it was thought that a large number of continuous distributions 
follow the normal law. In fact the tendency grew up to make them 
to approximate the normal distribution. However, it is now 
realized that although many random variables encountered in 
practice appear to be approximately normal, the distribution 
cannot be applied very generally. In fact a contradiction in the 
distribution is that while measuring heights, according to the 
normal law, there is alway a finite probability that the height is 
negative since the variable extends from -« to m. It may be 
remembered that since in the normal distribution, the probability 
of the variable lying outside a range three times the standard 
deviation from the mean is very small. the use of normal 
distribution has been found to be very helpful in the study of the 
theory of large samples since if a sample is large enough, other 
sample statistics (variance, median etc.) tend to be nearly normally 
distributed. The mathematical properties of the distribution 
also have made it fascinating for statisticians and mathematicians. 

8*22. Solved Examples. 

1. Comment on the following :— 

“Everybody believes in the law of errors (the normal carve), the 
experimenters because they think it is a mathematical theorem. 
the mathematicians because they think it is an experimental fact." 

(I. A. S. 60) 

The normal curve was first discovered by De-Moivre as an 
approximation to the binomial theorem and later Guass and 
Laplace deduced it as distribution of errors. Hence the normal 
curve due to its mathematical properties (symmetry etc.) became 
very popular with mathematicians as well as experimenters each 
having thought that it was discovered by the other. The above 



218 


STATISTICS 


saying is due to Lipraan and shows the popularity of the distri¬ 
bution at that time. 

2. Jf two normal universes have the same total frequency but 
the standard deviation of one is k times that of the other , show that 

the maximum frequency of the first is ^ times that of the other. 

(M. Sc. Agra *51, ’59, B. Sc. Agra ’59, Lucknow ’46,1. A. S. ’47) 
If the total frequency is N, the equation of the normal curve is 

V(2 IT) Ol - 1 

Similarly the equation of the curve representing the second 
distribution is 


y= 


n 


(x—m,) s /2<V. 


...( 2 ) 


V(2 TT) 0 2 

The maximum frequency denoted by yf in the first case is given by 

N 

y 0 = 


and that in the second case by 

>’o'= 


V(2tt) a x 
N 

V( 2ir ) 


Hence 



since a l ^=ka i (given). 

3. In a normal distribution whose mean is 2 and S. D. 3 t find 
a value of the variate such that the probability of the interval from 
the mean to the value is 0 4115. Find another value such that the 
probability for the interval from x=3'5 to that value is 0'2307. 

m— 2, o=3. 

. , x—m x—2 

• • * • *2 * 
a 3 


From the tables of the areas of the standard normal curve 
the value of t for which 


is found to be 
Hence 



^=0-4115 


x 



1*35. 

- = 1*35, 


giving 


x=6*05. 



IMPORTANT THEORETICAL DISTRIBUTIONS 


219 


For x=3-5, we have 


3-5-2 



Hence the area under the standard normal curve between 
f=*5 to t x is 0-2307. The area between t = 0 to t=5 from the 

tables is found to be O'19146. Hence the area between r = 0 to 
t=t x is 

0-19146+0-2307 = 0-42216. 

From the tables the value of ^ = 1*42. 


Hence 


x-2 

3 ~ 


1-42, 


giving *=626. 

4. A factory turns out an article by mass production methods. 
From past experience it appears that 20 articles on an average are 
rejected out of every batch of 100 . Find the variance of the 
number of rejects in a batch. What is the probability that the 
number of rejects in a batch exceeds 30 ? (Bombay ’46) 

The probability of an article being rejected =- 1 2 cm> = =6» 

'- e - P = k.q=l-b = i. 

The mean m=np 


Variance 


= 100 x £ = 20. 

o* = npq 


=■ 1U0 x £ x £ = 1 6. 

For x = 30, r= 30 “—=2 5 

From the tables, the area of the standard normal curve 


Hence 



</, (0 =-49379. 

•t> (0 dt = ’S — -49379 


= •00621. 

Hence the required probability = -00621. 

5. If log l0 x is normally distributed with mean 4 and variance 
4 t find the probability of F201 < * < 83,180,000, given 

*°gio 1202 = 3'08 i log l0 8318=3-92 . 

(M. A. Punjab ’58, B. A. Hons, Delhi ’59) 


220 


STATISTICS 


We have, 

log 10 1*202—’08 and log 10 83180000—7*92. 

x—m 


Now 


t— 




*2 — 


a 

■08-4 


2 

7*92-4 


= — 1*96. 


= 1*96. 


From the tables, the area of the probability normal curve 
between /= ± 1*96 is *95 which is the required probability. 

6. Show that 


r 

We have, 

r >•- 

J o 


,-x 


2 


dx=xe ["/ + ' (2x*)+ ~- 5 (2x*)'+ • • •]• 

(M. Sc. Agra ’55) 


dx=\ xe-* Z J* -t-2 | x'e-* 2 dx , integrating by parts 

3 »i® 2 p * 


—■«[f •-]>!! 


2,2 « 

—xc~ x 


3- Jo + 3 J o * 4e ~‘ ^ 

‘ a x*e~* J ^ 6 e-* 2 dx 

= xe-* 2 [l + i (2x*)+i 

Assume the mean height of soldiers to be 68 22 inches with 
a variance of JO'8 inches square. How many soldiers in a regiment 
of 1000 would you expect to be over 6 feet tall , given that the area 
under the standard normal curve between x=0 and x=’35 is 0'J368 

and between x=0 and x=l 15 is 0'3746. 

(I. A. S. ’56, B. A. Hons. Delhi ’58) 

Here m=68*22, c=V 10*8=3*28, 

72-68*22 


t = 


3*28 


= 1*15. 


Hence the 


The area between x=0 and x—1‘15 is 0*3746. 
area beyond f=l*15 is *5-*3746=*1254. 

The total frequency is 1000 and the probability of the height 

of soldiers to be above 6 ft. is *1254. 

Hence the expected number of soldiers above 6 ft. 

=*1254 x1000 
= 1254. 

8. If skulls are classified as A, B, C according as the length , 
breadth index is under 75 , between 75 and 80 or over 80 , find 
approximately (assuming that the distribution is normal) the mean 


IMPORTANT THLORETICAL DISTRIBUTIONS 


221 


and S. D. of a series in which A are 58%, B are 38% and C are 4%. 
being given that iff ( t )=-jL-~ exp ( _ t z } dt> then 

f (-20) = -08 and f (1‘75)—0‘46. (M. Sc. Agra ’55, ’60) 

Let the mean be m and S. D. be a. As for A, the area to left 
of the ordinate at x = lS is '58, the area between the mean and 
75 is -58-*5 = *08. 


Hence 


/ a — m 
<7 


= ■20 


..(I) 


Similarly the area above x = 80 is - 0 * or that between *=/>; 
and x = S0 is *5 —*04= 46. 

80 —m 


Hence 


= 1*75. 


•..( 2 > 


Solving (I) and (2), we get 

m = 74’4 

<7 = 3 - 2 . 

9. A certain examination was taken hy JOOO students who are 

l°n„L C LV Si £ e r into sub-groups A, B,C, D, E according to ability 
range of ability to be equal in the sub-groups. On the assumption 

sCuUbe^la^A^^i iS ,,ormall y distributed, how many student v 
should be placed in each group, given that 

f (0'6)~ *225, f (l • 8) =-0'463, f (3'0) — '499 

* a si 

exp ( — hx*) ~ ? 

o y/ 2 tt 


where 


(Agra B. Sc. ’60) 


Here a= 


x—ni 


' / - 

n.,.\ 

, the range of ability has been divided into five equal 

int; 8 va°.Tr 8in8 fr ° m ' n + 3 ^ « ch ^-group of 

The number of students between m and m+-6*= -225 x 1000 

:2x22 O S=45 e 0 n s n,be | r ,° f StUden,S b — «-*• and L 

and rn+ t sl * " Umber of studcn,s between « 4.-60 


==(*463 — -225)x 1000 = 238. 
e number of students between m- f -1 • 8 <r and m-f 3cr 

= ( •499-*463) x 1000 = 36. 

we haie C fh! he normaI distribulion symmetrical about the mean 
we have the number of students in the various classes as follows : 

Sub-g roups N w m hers 


tn— 3a to m +1 * 8 a 36 

m— l* 8 <ito m — 6 <t 238 

m— ’Co to m+’ba 450 



222 


STATISTICS 


tom+l'Sar 238 

m +1 *8a to w-f 3a 36 

The total number of students come out to be 998 in place of 
1000. The rest two students can be allotted to either of the 
extreme classes since at the ends, the intervals are open. 

10. The following table gives the frequencies of occurrence of 


a variate x between certain limits :— 

Variate x Frequency 

Less than 40 SO 

40 or more but less than 50 33 

50 or more 37 


The distribution is exactly normal. Find the distribution and 
also obtain the frequencies between x=50 and x—60. 

(Agra B. Sc. ’60) 

The total frequency is 100. Hence the proportion of the area 
of the normal curve between limits :— 


Variate x 

Ra tio 

Less than 40 

'3 

40 or more but less than 50 

•33 

50 or more 

*37 


1 


x — m 


If </, (r)= e~ li ‘ 2 where t = ~ —\ m and a being the 

V( 27r ) o 

mean and standard deviation of the distribution, we have 


,40 - m)/ ° *(,)*-■ 30 

— oc 

(m—40)/o 


or 


I 

j(f7! 


(m—40)/cr 


oc 


[•■• r 


(40 — m)/n f oc 

<f> (/) dt= I 


l > (0 ^‘= 1 ^ */» (0 dl +\ 

= •50-f *20= *70. 

</» (/) dt= *30 so that 


</■ (/) dt 


OO 


i 


(m —40)/^ 
(m—40)/a 


0 (/) <//= *50— *20= *2oJ. 


From the tables of the areas of the normal curve, we get 

^°=* 5244 . ...( 1 ) 


Also f (5 ° m) '° <b(t)dt=\-- 37=0*63 

I — OO 


50 -m 


giving 


= •3318. 


Solving (1) and (2), we get 

<y=ll-68. 




IMPORTANT THEORETICAL DISTRIBUTIONS 


223 


m=40+ll 68 X *5244 
= 40 + 6-125 
= 46*125. 

Hence the mean is 46* 125 and standard deviation 11*68. 

The frequency distribution is given by the equation 

(*-46 125)* 

_ 100 2 ( 11 - 68 )* 
y ~ y/(2n)y. 11-68 e 
For x = 60, 

_60 —46* 125 
f 11*68 
= 1-225. 

The area of the standard normal curve lying to the left of 
t= 1*225 is 0*8898 i.e. the frequency of the variate for less 
than 60 is 

100x *8898 = 88*98. 

Hence the frequencies between x=50 and x = 60 arc 

88*98 — 63 = 25*98 


=26 nearly. 

11 . In a distribution which is exactly normal 31 per cent of the 
items are under 45 and 8 per cent oxer 64. Find the mean and the 
standard deviation of the distribution. (Agra B. Sc. ’58) 

Let the mean and standard deviation of the distribution be m 
and a respectively. Since 31% of the items are under 45, (50 — 31) 
i.e. 19% lie between 45 and m. Taking our previous notations 

f0 X — ftt 

I , 19 where / = 

J (45 -m)b * 

f (m — 45)/a 

or '/» (0 dt= * 19 

J o 

w—45 

giving--— = *5 nearly (from the table of areas) .. .(1) 


o- -i t ((64 —m)/r» 

Similarly </»(/) </r = -5— -08= -42 

J 0 

which from the area tables of the normal curve gives 

64 — m 

= 1 *4. 


...( 2 ) 


Solving ( 1 ) and (2), we get m=50, 

(7=10. 

12. A minimum height is to be prescribed for eligibility to 
government services such that 60% of the young men will have a fair 



224 


STATISTICS 


chance of coming up to that standard . The heights of young men 
are normally distributed with mean 60'6 inches and S. D. 2*55 
inches Determine the minimum specification . (Madras *55) 

The area of the normal probability curve above the minimum 
height should be ’ 6 . 

Also m—606, c= 2*55. 

From the tables the value of t for which the area of the 
normal curve between t and oo is *6 is —*2533 so that 


x - m 


= -•2533 


x-r- 60*6 


= -•2533 


255 

giving x=59’95* 

so that the minimum height prescribed is 59 95 

13. Prove that for the normal distribution , the quart He 

deviation , the mean deviation and the standard deviation are 
approximately in the ratio 10 : 12 : 15. (Tr. U. 1946) 

We have seen that Mean Deviation of the normal distribution 
is s<7 =*8 <t. 

The quartile deviation = =*6/45a. 

Hence Quartile Deviation : Mean Deviation : S. D. 

= •6745 : -8 : 1 
= 10 : 12 : 15 approx. 

14. If p is the probability of winning a single game , find the 
probability of winning x games out of n played. 

If P=» and n — 18, show that the probability of winning more 
than 9 games is given approximately by 

— 7 f°° e-Wdt. 

\'(2n) J-s (Andhra’53) 

According to binomial distribution, the probability of win¬ 
ning x games out of n played = n C t p x (1 —p) n ~ x . 

For the second part we make use of De-Moivre-Laplace 
Theorem which states as : 


The sum of those terms of Binomial (p-f </)" in which the 
number of successes x range from x x to x 2 both inculsive, is 
approximately equal to 


L ('■ 

V(2jt) )t l 



IMPORTANT THEORLTICAL DISTRIBUTIONS 


225 


where and ,,=**±±=^. a- M>- 

Now according to this theorem, the probability of winning 

x games or less is given by ^ 2 n) fl ^ ^ ^ dt a ° d th3t 

1 |*00 

of winning more than x games is J dt 


Here /? = §, <7=i* w = 18, x=9. 
Hence np= 12, a=y/(npq)=2. 


I r ^ *> 

Hence the required probability= | ^ {9 + i _ 12) e “' 2 dt 

e-Mdt 

V( 27r > J -2 

v1?5 15. In a sample of 1000 cases, the mean of certain test is 14 
and S. D. is 2*5. Assuming the normality of distribution , Jind 
(i) how many candidates score between 12 and 15; (it) how 
many score below 8 ; (Hi) what is the probability that a 
date selected at random will score above 15 ? (Lucknow B. Sc. -to) 

Here m=14, ct = 2 5. 

x — 14 

• • 1 = T*y * 

(i) The area of the probability normal curve lying between 

x=12 and 15, P (12 < x ^ 15) 

e-' 1 ' 2 dt where *! = ,2 ~ 1 - = — *8 and t >= 1 ^ * 4 = * 4 
+ /(?i t\ 2 5 * 3 


-vh, I-.' - " 1 '* 

- vk ,[ i : 


From the area table of the normal curve this value comes out 
to be *2881+ 1554 = -4435. 

Hence the number of candidates scoring between 12 and 15 
marks is 44)5 x 1000 = 443'5 

= 444 nearly. 

I f'' -.2,7 .. . 8 - 


(«i) P(- 


Vl 2?T 


e ,3/2 dt where /'= 8 .J** 
jj-cc 2*5 


= 1 I e-<'H d, 

\Z(2tt) J — » 

=0082. 




226 


STATISTICS 


is 82. 


Hence the number of candidates scoring less than 8 marks 


(in) P( 15<x)= 


1 


J*^ e l 'f 2 dt where /'=J 


15-14 


2-5 


V(2n) 

=vm \- 4 e_,,/2 *=' 3446 . 

The probability that a candidate gets above 15 marks is *3446. 

16. The local authorities in a certain city instal 2000 electric 
lamps in streets of the city . If the lamps have an average life of 
1000 burning hours with a standard deviation of 200 hours , 
(i) what number of lamps might be expected to fail in first 700 burning 
hours and (ii) after what period of burning hours would we expect 
that 10% of the lamps would have failed ? Assume that the lives of 
the lamps are normally distributed. You are given that 

F (T50)=0’933, F (1*28)=*900 


where 


«-r 

J —an 


dz. 


(B. A. Aligarh) 


V(2n) 

(i) We are given 

m=!()00, < 7 =200. 

Since the normal distribution is symmetrical, the area of the 

normal curve to the left of a: = 700 in this case is equal to the 

area to the right of *= 1300. 

For *=1300, 

1300-1000 


z= 


200 


= 1*50. 


The probability of a value of x lying to the left of 1300 is *933 
i. e. P (x > 1300)=/ > (x < 700) = !— *933 

=•067. 

Hence the number of lamps expected to fail in first 700 hours 

=2000 x-067= 134. 

(ii) We are required to find the value of x such that the area 
of the standard normal curve to the left of it is ’1. Now the area 
to the left of z=P28 is ’900 ; hence by the symmetrica! nature of 
normal curve, the area to the left of r=-1*28 is equal to the area 
to the right of z = l-2S i. e. *1. 

Hence z= — 1*28 

.• . x — ! 000 

I • c • — 

200 

giving ,x = 744. 


= -1*28 



IMPORTANT THEORETICAL DISTRIBUTIONS 


227 


Hence after 744 burning hours, 10% of the lamps are expected 
to die out. 

17. A company uses many thousands of electric lamps 
annually, burning continuously day and night. Assume that under 
such conditions the life of a lamp may be regarded as a variable 

normally distributed about a mean of 50 days with a S. D. oj 
19 days. 

On January 1, 1951 , the company put 5000 new lamps into 
service. How many would you expect to need replacement by 
(a) February 1 , (b) April 1 ? The lamps may be supposed all put 

into operation at about the same time of the day. 

Here mean is 50, standard deviation 19. 

(a) From January 1, 1951 to Feb. 1, 1951, 31 days have 
elapsed. The deviation from the mean is —19 days i. e. a. 

The proportion of the number of bulbs having life less than 
31 days is given by the area of the normal probability curve lying 
to the left of —1 i.e. between — oo to —1 which from the tables is 
*1587. 

Hence the number of bulbs expected to fail during this period 

= 5000 x *1587 
= 793*5 
= 794 nearly. 

(b) Upto 1st April, the number of days is 90, the deviation 
from the mean is 40 i. e. 2*11 <t. 

The area of the standard normal curve between — » to 2*11 
from the tables = *9826, which is equal to the proportion of the 
number of bulbs having life less than 90 days. 

Hence the bulbs expected to fail upto 1st April 

= •9826x 5000 
= 4913 

18. An establishment uses 1000 bulbs which are kept burning 
approximately for four hours every day. Past experience of costs 
incurred in replacing burnt-out bulbs has shown that it is profitable 
to replace all the WOO bulbs whether some of them are burnt out or 
not once in four months (120 days). Bulbs burning out during this 
period are not replaced till this period and the establishment are 
prepared to suffer the inconvenience so caused. Assuming that life 
of a bulb is nearly normally distributed with a mean of 450 hours 
and standard deviation of 30 hours, find the expected number of 
hours for which a bulb is dead at any point during the four-month 
period. 


228 


STATISTICS 


[// is known that the area of the normal curve beyond a distance 
of one standard deviation from the mean is equal to O'159. You 
need not simplify any expression which needs the use of tables not 
supplied to you.] (I. A. S. ’60) 

The probability that the bulb lives upto x that is it dies 
between x and x+dx hours is given by 

rf/= -/J-T- e~( x ~ m ‘*l 2a ' dx 

The bulbs are replaced after 480 hours of installation and 
hence a bulb which is burnt at x hours remains dead for 480—or 
hours. 


Hence the expected number of hours for which a bulb 
remains dead, 

1 T480 

E (480—x)= - 7 ^—- e —(*—«0V2«> (4 80—x) dx, 

V (2w) a J 

Now w=450, a = 30. 

D . x —450 
Putting — 3 Q-=', we get 




dt 


30 


V(2 n) 


f e~ l ’l 2 dt- f te-W dt 
J -» V(2 rr) J.* 


= 30 (1—0*159)4- 


_?2_ [ e -tvT 

v ( 2n ) L 


=25*23 + 


30 


(2 rr) 


I—I 


which is the required number of hours a bulb is expected to remain 
dead. 


19. The marks obtained in statistics in a certain examination 
are found to be normally distributed. If 12’5% of the candidates 
obtain 60% or more marks, 39% obtain less than 30 marks , find the 
mean number of marks obtained by the candidates given 

- •27 •28 -29 T14 T15 l')6 

a 

A *6064 •6102 ’6104 ‘8727 '8749 '8770 


where A = vh )\ l - 

the mean. 


e <3/2 dt, t = x ,x being the deviation from 

O 

(B. Sc. Lucknow ’47) 


Since the area of the standard normal curve to right of 



IMPORTANT THEORETICAL DISTRIBUTIONS 


229 


t = 115 is *1251 and the proportion of the candidates obtaining 
more than 60 marks is *125 (given), we have 

60—m . . _ 

-=1*15. ...(1) 


Now the proportion of candidates getting less than 30 marks 
is *39 ; the area between -=o to 30-f 2 (m— 30) of the standard 
normal curve is '39 + 2 (*5 —*39) = *61. 


x 

On interpolation, the value of for A as *61 comes out to 

o 

be *279. 

.i x 30 

Hence - =-=*279. ...(2) 

a c ' 

Solving (1) and (2), we get 

60 — m_ 1-15 
m — 30 *279 

giving m = 35*8. 

20. The following table gives the test scores and their frequen¬ 
cies; convert them into standard scores with mean 50 and standard 
deviation 10. 

Test scores 9 8 7 6 5 4 3 Total 

Frequency 2 6 11 11 9 8 3 50 

(Agra B. Sc. ’61) 

In order to find mean and standard deviation, we form the 
table. 


Xi 

ft 

Zi=xt- 

-6 f& 

ft? 

9 

2 

3 

6 

18 

8 

6 

2 

12 

24 

7 

11 

1 

11 

11 

6 

11 

0 

0 

0 

5 

9 

— 1 

— 9 

9 

4 

8 

-2 

-16 

32 

3 

3 

-3 

-9 

27 

Total 

50 


-5 

12 T 

We get 






the mean 

w -- 6+ 

50f^‘ 




=6 + 

(-5) 

50 




= 59, 



230 


STATISTICS 


121 /-5\ 8 

“ 50 "V 50 ) 

= 2 * 41 . 

/. a= 1*55. 

If the new scores denoted by x with mean m and S. D. a are 
to be converted to standard scores denoted by x' with mean m' 
and standard deviation g', we have 

x—m x'—iri 


or 


x ,Jx-^W 


fx-5*9)x J0 +50 


“ 1*55 

= 6*45x-J-12*95. 

Substituting the values of *, we get the converted scores as 
71*00, 64*55, 58*10, 51*65, 45*20, 38*75, 32 30. 

8*23. Fitting a Normal Distribution to the Given Data. The 
mean and standard deviation of the data are found. The values 
of t for the given values of x are given by 

x—m 

where x is the middle value of the class interval (in case of grouped 
data), m is the mean and c the standard deviation. The values 
of the ordinates for the calculated values of t are read from the 
table of ordinates of the standard normal curve and then multi- 
N 

plied by —, where N is the total frequency and i, the length of 
the class interval. 

Example. Fit a normal distribution to the data given below : 
Interval mid.-points 100 95 90 85 80 75 70 65 60 55 50 45 
Frequency 0 1 3 2 7 12 10 9 5 3 2 0. 

Here JV=54, m=71*9, ct=9*95,/=5. 

The respective values of / are 
2*89, 2 39, 1*89, 1*39, *88, *38, ~*12, -*62, -1*13, -1*63, 

-2 13, -2 63. 

The values of the ordinates of the normal curve 


«/• ( t) = 


1 


V(2n-; 


/7 2 


IMPORTANT THEORETICAL DISTRIBUTION 


231 


for the above values of t from table are respectively : 

•0061, -0229, *0669, *1518, *2709, -3712, *3961, *3292, *2107, '1057, 

•0413, *0126! 

Note that (—/) = </» (/). 

c . . . c 9*95 

Since / = 5, - = — - 

/ 0 
= 1-99. 


Hence multiplying the values of the ordinates bv —- 

1*99' 

27*14, we get the theoretical frequencies as 
x 100 95 90 85 80 75 70 65 60 55 50 

ft '2 *6 1-8 4-1 7*3 101 10*7 8*9 5*7 2*9 M 


/. e. 

45 

*3. 


8 24. Deduce the law of probability distribution of errors in 
the form ^ 




— /rx 


2^2 


stating carefully the assumptions involved. (Agra B. Sc. ’59) 

Denote the probability of occurrence of X by / (X), where 

variable X takes the values X lt X 2 , X 3 . X n . If we shift 

the origin to a point a (to be specified later) on the x-axis, we 
may write x= X— a, where x is the deviation or error of X from a. 

Now the compound probability of the occurence of the errors 
x i* x 2 . .. x n is given by P, where 

./(*„). 

We shall assume that 

(a) P is continuous and differentiable over the whole range, 

(b) There is only one maximum value of P in the range. 

(c) The value of a makes P. a maximum, which means that 
a is the average of X x , X 2 , .... X n . 

If we differentiate (1) logarithmically, we get 


71S*\) d _X\ , /' <x 2 ) dx 2 , . /' (* n ) dx n 


/(AT,) da + /(a 2 ) da ./(*„> da 

since P is to be maximum. 

But X x a—x Xt X 2 — a — x 2 , etc. 

Hence = _. 

• • • _ 1 • 


4-... H-* 


0. 


...( 2 ) 


da da 

Then (2) gives 

_f'Jx l ) f (x*) 

7(x x )~ /( X 2 ) 


/'.(Xn) 

7 (x„) 


0 


...(3) 


232 


STATISTICS 


If we take the true mean at fl, we get 


n 


n 


...(5) 


2 x r = £ (X r -a)=0 

r=l r= 1 ... 

ie x 1 +* 2 +x a +.--+*»= 0 - , •” w 

Multiplying (4) by k and subtracting from (3), we get 

+ ''' < rkh-)- 0 - 

This relation will be satisfied if 

t^ = -kx. 

f(x) 

for x=x lt Xo, . • x n* 

Integrating (5), we get 

kx 3 

f(x) = Ae 2 , 

where A and k are constants to be determined. 

For P (or log P) to be maximum, we get 

£ (iog p> < 0 

„ J fix,)/'<*,)-[/' < 0. 

27 1 " U [Xr)Y f 

kx* _*** 

But /(x)=^ _X and/' (x)=-Akxe 2 =-kxf{x) 

and /’ <*)=-* W' (*>+/(*)]• 

from (6), „ 

/ fx r ) [-ktr C (Xr)-kf(x r )}-kW (/^ r )l < 0 

{fi*r)Y , „ 

/* (x r ) kx T .(—kx r f (x r W-k I / (x r )) 2 -fk^ r M/(x r ^} > q 

* --T/^r)F 

> 0 . 

Hence k is positive, say 2/r. 

We now find A f from the condition that 

Ae- /,9x * dx— 1 


...( 6 ) 




or 

or 


i 


CO 

— CO 


or 


•• 

which is the required law of probability distribution of errors 
usually called the Gaussian Error Law. 



IMPORTANT THEORETICAL DISTRIBUTIONS 


233 


Exercises 

1. In an experiment with 800 seeds in groups of 10, the 
following results were obtained : 

*01 2 3 456 Total 

/ 6 20 28 12 8 6 0 80 
where/denotes the number of groups in which x seeds germi¬ 
nated. Fit a binomial distribution to these data. Calculate 
the theoretical frequencies. (B Sc. Agra ’55) 

Ans. 80 (*7825-f- 2175) 10 ; the theoretical frequencies are as 
* 0 1 2 3 4567 

ft 6* 19 1 24*0 17-8 8*6 2*9 0*7 0 1 

2. Calculate the ordinates of the binomial 1024 (*5 + *5) 10 and 
compare them with those of the normal curve. 

Binomial 1 10 45 120 210 252 210 etc. 

Normal 1*7 10*5 42*7 116*1 211*5 258*4 211*5 etc. 

3. Five thousand candidates appeared in a certain examination 
carrying a maximum of 100 marks. It was found that the 
marks were normally distributed with a mean J9*5 and with 
a standard deviation 12*5. Determine approximately the 
number of students who secured a first class for which a 
minimum of 60 marks is necessary. You may use the table 
given below. 

The proportion A of the whole area of the normal curve 




lying to the left of the oidinatc at the deviation * is 

<7 


* 

a 

A 


1*5 

93319 


1*7 


1-8 


1*6 

*94520 *95543 *96407 

(M. Sc. Agra ’61) [Ans. 250J 

Assume that the distribution of grades irr a class of 500 fresh¬ 
men is normal with mean 72 and S. D. 10. The instructor 
wants to give better grades as follows : 10% A’s, 30% B’s, 
40% C s, 15% D s and 5% F’s Compute to the closest score 
the divisions between A and B’s; B’s and C’s; C’s and D’s; 
D’s and F’s. [Ans. 85> 75> 64j 56 j 

In 1000 trials of an event of rare probability the frequency 
/of the number of successes x turned out to be 


■* 0 1 2 3 4 5 

/ 229 325 257 119 50 17 


6 

2 


7 

1 


8 

0 



234 


STATISTICS 


Calculate all the theoretical frequencies given that 

e-i-5 = *2231 (M. Sc. Agra ’51) 

[Ans. 223*1, 334-7, 251-0, 125 5, 471, 14-1, 3*5. 0*8, 0*2] 

6 . In a packet of flower seeds f are known to be pink flowering 
and the remainder are yellow. Calculate the probabilities of 
getting 0, 1, 2, ..6 pink flowers in a row of six plants. If 
250 rows of 6 plants are planted, approximately how many 
will contain (a) all pink flowers, (b) all yellow flowers ? 

[Ans. j,, (64, 576, 2160, 4320, 4860, 2916, 729); (a) 1; (b) 12 ] 

7. A firm making an electrical switch produces 1% defective. 

What is the chance of getting at least 6 defectives in a box 
of 200 switches on the assumption that the distribution o 
defectives in Poissonian. [ Ans * 0 

8 . A bombing technique secures 1 out of 10 hits in the target 

area. Use the Poisson distribution to determine how 
many barrels should be launched in order to have a 90% 
chance of securing at least 8 hits. [Ans. 120] 

9. In marking one thousand English papers, five classes were 
distinguished, a, (3 , 7, 5, e. Assuming that the distribution 
is normal and that the difference in score between each class 
and the next is constant, estimate the frequencies in each of 
these classes. (Assume that the distribution extends from 
— 3*5a to -f 3’5<r on each side of the mean.) 

[Ans. (a) 18; (/3) 224; (7) 516; (8) 224; (e) 18.] 



As a rule h per cent of certain manufactured products are 
defective What is the probability that 1000 of them will 


have 10 or more defective ? 



P=e~ b 






CHAPTER IX 


X 


MOMENT GENERATING FUNCTION AND CUMULANTS 


9*1. Introduction. The nth moment about the origin of a 
distribution with probability density function 

y=/(x) ...(1) 

is given by 

ix n ' = 2x i n fi (x) or j x n f(x) dx ...(2) 

according as the distribution is discrete or continuous. The 
mathematical expectation of g (x) 


E {g ( x)} = Zgi (x) f, (x) or j g (x)f(x) dx. 

i J -» 

Now if we put g (x) = e n *, we get 

E (<?«*)= 27e 0xi /, (x) or f* e (,x f (x) dx 

for discrete and continuous distribution respectively 

or E (<■')')=-ty, (x) |l + flr ( + (9 2 X j )! + . • ■ + + - • ■ } 

= I + 0/V + /*** +•••+—, Mr.' 4- . . . 

fa • ft l 


( 3 ) 


(4) 


(3) 


The function £ (e ,,JC ) is known as the moment generating 
function of / ( x) with the condition that the sum or integral given 
by (4) converges over a range of 6 and the summation or inte¬ 
gration is permissible. The nth moment about the origin is the 

0 n 

coefficient of — in E (e° x ) which is also written as A/ x (0). It 

may be noted that 0 is a parameter taking real values and has no 
other particular significance. 

9*2. Change of Origin and Scale in Moment Generating 
Functions. If the origin for the variable is taken at some value a, 
we have 


M x (0)=2>° ( * ,-a) /< (x) dx 

I 

= /, (x). 


4 



236 


STATISTICS 


Thus the change of origin from x=0 to x=<x results in multi¬ 
plying the m.g.f. about x=0 by e ~ a9 . 

If the scale of measurement of x is changed so that x is 
changed to z, we have 

M z (0)=2^/(*). 

2 

If zssArx, we have 

M t (d)=Ze kQx f(x) dx 
—M x (kd). 

Thus the effect of a change of scale from x to kx is to replace 
6 by k6 in the m . g. f 

It may be noted that 

Mx } 0 =o. 

The name ‘moment generating function* is due to the fact that 
th e m.g.f *generates * the moments simply by differentiation or 
expansion of the m. g.f. 

9*3. An important property of the moment generating 
functions is given by the theorem : 

The moment generating function of a sum of a number of inde¬ 
pendent variables is equal to the product of their individual moment 
generating functions. 

The expected value of the product of a number of independent 
variates is the product of the expected values of the variates. 
If -Yj, x-j ,.. ,x r , . .are a number of independent variates 

E {e° l x *+**+ x *+ : )}=E l e Qx i .e 0x *.e 0x *.. 

=E ( e 0x ').E (e®**).E (e 0 **)... 

= M Xi ( 0 ). M x% (6 ). M Xt ( 0 )... 

and hence the theorem. 

The moments of various distributions studied in previous 
chapters can be found with the help of m. g. f., but this method 
is not always very convenient. However, moment generating 
functions plav an important role in statistical theory. Some 
standard distributions are considered in the following article. 

9*4. Binomial Distribution. The relative frequency of x 

successes is 

n C x pFq* 


MOMENT GENERATING FUNCTION AND CUMULANTS 


237 


so that 


M z (0)=Z n C x p x q n ~*e 


=(q+pe° ) n . 

The moments can be obtained by the formula 

, d*M x ( 0) , 
l l k = ~^fj k —* when 0 = 0. 

Now M x ' (0) = npe° (q+pe° 

M x " ( 0) = np [e Q {q+pe®) n ~ x + (n — 1) pe 20 (q+pe*)*-*}. 
Putting 0=0 in the above results and remembering q+p = 1, 

p.i=np, 

P- 2 '=np 1) p]= n *p*-). n pq' 

1*2 = 1*2 — Hi" 2 —npq. 

9-5. Poisson Distribution. (Agra B. Sc. ’61) 

The probability of happening of x rare events is given by 

m x e~ m 

x ! * 


M. (tf>= i 

X 


* = 0 


0 


M x ' ( 0)~nie°. e m 1)^ 

M x (0)=tn {^0 e m ( e ° — 1)nje 0m + e 0 e m («°— l). 

Putting 0=0, we have * 

Hz'=m (m+ 1 ). 

HeHce to-tf+m-m-nfi. (Agra B . Sc . , 61) 

9 6 * Negative Binomial. 

ex P aL-:;r ive binomiai is ,he 1 —»■»<= 


- ^ ^ 

ie. p n .( /7-t-y—l) 




00 


(£)' 


(0) = Z e°* P ( X ) 

x^O 


j? *n+l)< n + 2)...( n + x _ i 

*=~o 


X ! 


. («)■ 


= (^-Pe°r n . 
/ </ 

* 


«2 ] 

Jo=0 



238 


STATISTICS 


=n (O - Pe")-"-K Pe°) o=0 

=«/’. 

=nP{(»+t)ie-f« 1 /’e 20 +(2-/’e 0 )-"- 1 e°}]o =0 

=n/> {(n+1) />+!}, 

— f )* 

= nP {(>i+l) P+l}-(nP) 2 

= nP(P+l) 

—nPQ f since Q—P= 1. 


9’7. Normal Distribution. 

The probability function with the mean as the origin is 
given by 


y= vr^)o e 




^Lto L e0Xe ~ m dx 

=v^r, L exp ( flt - 2?) dx 


.f — f exp { — (*— 0 o 2 ) 2 j <**• 

V ’( 27r ) 47 J -» P l 2<t * 


Putting — 777 — =f, we get 

y/{2o) 

dx = \l2o dt. 

Hence M x (0) = ~ e* 9 ’ 0 * [" exp (-P) * 

Y 7T J —® 

(^0 2 cr 8 ) 2 , (J0W* . 

^el 0 * 0 =» 1 +A0 2 <J 2 + ^yp + * • • + “j— + • • • 

Since the m. g. f involves only even powers of 0, all the odd 
moments about the mean vanish and 


n..=0® , >" 2 "f 


— ITS On—W rs 2 


9-8. If the independent variates x, (i= 1, 2, 3,.. .n) are 
normally distributed with means mi and variances aj*, the variate 
CfXi is normally distributed with mean 2 cpn, and variance 2 c?xf. 


MOMENT GENERATING FUNCTION AND CUMULANTS 


239 


The m. g. f. of the normal distribution about the mean is 
exp (mB+l&o*). If the variate is cx in place of then by the 
property of m. g. f.’s proved earlier in § 9*2, the moment genera¬ 
ting function is exp. \c 2 m 2 o*). 

If there are n independent variates c,x,, i= I, 2, 3.. .n, the 
m. g. f. from the last theorem is 

exp {0 £ Cirtii+lO 2 £ 

But this is the m. g. f. of a normal distribution with mean 


£ c,m; and the 

i = l 


n 

variance £ 

« = 1 



Hence the theorem. 


In particular if X and Y are two independent normal variates 
with means m, and m 2 and variances a, 2 and o 2 2 respectively, 
X-bY is normally distributed with mean m^m, and variance 
a^ + oo 2 . Similarly X-Y is normally distributed with mean 
m x —m 2 and variance (Agra M. Sc. ’62) 

Another important additive property of normal variates 
can be deduced by putting c, = -. wi,=m and c?, = o; we get the 

important result : 

If the independent variates x< (i = l, 2, 3,.. .n) arc normally 

distributed with a common mean m and common variance a 2 , their 

2 

mean is also a normal variate with mean m and variance —. 


9*9. Cumulants. If the logarithm of the m. g. f. M x (0) 
can be expanded as a convergent series in power of 0, the expan¬ 
sion being denoted by k (0), we have 

k (0) = log, M z (0) 

02 03 

= Ar 1 0 + /c 2 2~[ + k-i j~j + • • • 

= log + + I*/ 2 ,+ ••• J. 

Comparing the coefficients of the similar powers of 0 in the 
two scries, we get 

k\ — v-\ , k z — fi 2 /-*i ' = /•*** 

k 3 — i*3—3niHi' + 2/ii' 3 = /^ 3 , 

*4 = /V -4/x 3 V.' - 3 W) 2 + 12/V/V 2 -6/i," 

= M 4—3/Jj, 2 . 



240 


STATISTICS 


k r is called the rth cumulant of the distribution and k (0) as 
the cumulative function . 


Since M x _. (»)=M X (0) e~ 

hence the cumulative function relative to a is 

ka (0) = log, M x (0)—a0. 

Thus all the cumulants except the first remain unchanged 
by a shift of the origin of the values of the variate. The property 
gives the cumulative function a advantage over the m.g.f 

since a change of origin from 0 to a multiplies the m. g. f. by e~ 
while the only change in cumulative generating function is addition 
of the term - ad, that is, the first cumulant becomes — a and 
the others remain unchanged. The mean and other moments 
can be found by calculating the cumulative function about any 
convenient origin. 

9*10. Additive Property of Cumulants. The rth cumulant 
of a number of independent variates is the sum of the rth cumulants 
of the individual variates . 

The property follows from the theorem that the m. g. f. of the 
sum of a number of independent variates is the product of the 
m.g.f.sof the variates. Since the cumulative function is the 
logarithm of the m g. f, the rth cumulant of the sum shall be the 
sum of the rth cumulants of the individual variates. 


Example 1. For the rectangular distribution y = j^> where 
— a < x < 0 , show that the moment generating function about 
origin zero is given by — sink at. Also show that 


f x 2n =a 2n l(2n+l.) 

(Agra M. Sc. ’58, B. Sc. ’63) 


M x (/) 


=i 


0 i 

e iz — dx 
-a 2a 

l e°t_ e - a f 
nt * ? 


= — sinh at 
at 



The first moment pf is at zero, since the coefficient of t is 


zero. 


MOMENT GENERATING FUNCTION AND CUMULANTS 


241 


Hence n=Pt* 

f- n 

=coefficient of ^— 

2/i ! 

_ a 2n 

~Tn+ V 

Example 2. Find the m.g.f.forthe triangular distribution 
defined by 

f (xj = x, 0 < x < /, 

/ (x)=2—x, 1 < x < 2. 

We have A/ c ( 0 )=J xe°* dx+j (2—x) e 0x dx 

4 m; 

(e 8 -1)‘. 

9*11. The sum of a finite number of independent Poissonian 
variates x„ x 2 ,..., x„ iv///i means m lt m 2t ..m M respectively is a 

n 

Poissonian variate with mean 2 m.. 

i=l (Agra M. Sc. ’62, B. Sc. ’63) 

The m. g./. of the Poisson variate x, with mean w, is 

A/ XI (0)=e m ‘ 

Hence the cumulative function is 

l*)=log. A/ Xl (0) 

= (e 6 — 1). 

The cumulative function of the sum of the varitcs x„ x 2 ... ,x n 
is given by * 

k(Q)= 2 m t (e°— 1) 

« = 1 

= (e°-l) 2 mi . 

1 = 1 

n 

Thus we sec that the distribution of 2 is Poissonian with 

i= I 

mean as the sum of the means of the variates. 

9*12. Characteristic Function. If in the moment generating 
function, the parameter is taken to be purely imaginary, so that 

in place of multiplying the probability density function by e' )x , it 



242 


STATISTICS 


is multiplied by e lQx , where /=\/( — 1) and 0 is real, we have the 
function known as characteristic function. 

C(0)=£(e , ‘ 0X )=j“ oo e i0x f(x) dx. 

In m. g. f.’s we sometimes find that M x (0) does not exist 
but the characteristic function always exists, since | 

We see that 

C(0)=1. 

The wth moment about the origin is the coefficient of 

in the expansion of C (0). 
n ! 

Also £.C(«)} 0=0 

provided p n ' exists. 

9*13. Cauchy Distribution. Consider the continuous distri¬ 
bution given by the p. d. f. 




— OO < x < OO , 


It can be easily verified that 

’OO 


Now 


f 

■-i C 


/ (x) dx=l. 
x n 


dx 


oo I + X” 

h„s no meaning when n > 0 and hence the mean and other 
moments do not exist. However, the characteristic function 

,ie* 

C (0)~- I -~—-dx 


If 00 i 

7T J —OO 1 4- X 2 

1 f =» cos 6x+i sin Ox 
TT J —OO 


1 +*- 


dx 


Ox . 

- dx 


_2 f°° cos 0: 

“ttJo 1+* 8 

It can be shown that this integral reduces to e~ 6 , when 0 > 0 
and eO when 0 < 0. Hence c. f. of Cauchy Distribution is 

C (0)=e-l°l. 

Example. Show that the characteristic function of Laplace 
distribution 

f(x)=he ~\* 1 —oo < x < oo 

1 


is 


C(0) = 


1 + 0 - 



MOMENT GENERATING FUNCTION AND CL MLLANTS 


243 


Find also the mean , the variance and mean deviation about 
the mean. 

We have j f (x) dx = \ J e~ * 1 dx 


Also 


-*f- 


= e-* dx= 1. 


e tQx e \ x \ d x 


oo 


>10* 


<ix+t r 

J 0 


.iOX 


dx 


» [><l+iO) T -I_LT 

" 1 +» 0 |_ J-oo 2 1 —itf L 

= - [t+t* + !=**]“ 


,—a: (1— 




1 + 0 - 


or 


= 1 — 0 2 + 0 l —0° + ... 

= l + (<0) 2 +(<0) 4 +(/0) 6 +... 

Since the characteristic function involves only even powers of 
0, all the moments of the odd order vanish. We have 

fV—0, ^'—coefficient of in c (0) 

= 2 . 

Hence the variance = 2 

the S. D. = V2. 

The mean deviation about the mean 0 
= £ j_ oo I x | /(x) dx 

= 4 J_ oo (— x) e * c/x+£ j xe-* </x 

[«-*]." -■ H” 

= 4 (0+1-0+1) 

= 1 . 




CHAPTER X 

METHOD OF LEAST SQUARES AND CURVE FITTING 

10-1. Having collected some data, it is desirable to find out 
the form of universe of which the observed values are regarded 
as a sample. In other words, we try to find a functional relatio 
ship between the observed values so as to have a clearer p.ctu e 
of the universe of which our observations are a part, it is 
neither necessary nor possible that all the observed values shou d 
strictly satisfy this relationship, but the curve repres nfng th-s 
relationship should as far as possible pass closely lo all li e pomts 
The difference between the observed values and expect.d values 
is known as residual and the task is to minimize these residuals 
Since these differences may be positive in some case 
negative in others, it is more convenient to make the sum of he 
squares of these residuals a minimum This is known as the 

method of least squares. 

^ Suppose it is desired to fit a pth degree curve 

y =a+bx+cx , + ...+kx’ 

to the given values (Xi. y t ), i=l, 2, 3,.. .m. The curve as 
„+1 unknown constants and hence if m-p+l. we ge ■ P + 
equations on substituting the values of (x„ yd in t e eq ib i e . 
and a unique solution of the values of a, b, c . ..ki P 
However, if m > p+ U no unique solution is possible and we 

use the method of least squares. Now let 

y { '=a+bXi+cXi *+.. • +kx 4 9 

and the observed value of y for x, is * Hence if «, is the residual 
for this point, 

u-y.-y^y.-a-bxt-cxf-. ..-kxs .. .( 2 ) 

To make the sum of the squares minimum, we have to 
minimize 

S= £ u *= £ (yt-a—bXi-cxi *... .. .( 3 ) 

i=l * * = 1 

By differential calculus, S will have its extremum values when 

as 


—= 0 , l s h =o 

da 3 b 


3 k 


= 0 , 


...(4) 



METHOD OF LEAST SQUARES AND CURVE FITTING 


245 


which give us p+1 equations. 

Ey i =ma+bEXi+c2x i 2 +. 

E xtyt—aE x i-y bZxi 2 4-... 

Zxry^aExf + bZx? ... 


...(5) 


Zx i *y x =aZx* + bZx,i >+'+... J 

(Agra M. Sc. ’56) 

These are known as Normal Equations and can be solved as 
simultaneous equations to give the values of p + \ constants 
a, b, c, . .k 

This method does not help us to choose the degree of the 
curve to be fitted but determines the values of the constants 
when the form of the curve has already been chosen which is a 
matter of experience and practical consideration. 

10*2. Some Special Curves. Here below we consider the 
fitting of some curves : 

/ Straight line. Let the equation of the straight line to be fitted 

be y = a + bx. 

The normal equations are 

Zy=ma + bZx, 1 

Zxy=aZx + bZx*.) "’ 

Solving these two equations, we get the values of a and b. 

In particular, if a straight line to be fitted passes through 
the origin i. e. the mean (X, y) of observed values of x and y 
is taken as the origin, the equation of the straight line to be 
fitted will be 

y=bx. 

The normal equation is 

Zxy=bZx 2 , 

^ „ »i.» l ^ xy < \ 


so that 




Ex 


.. •(/) 


^ Parabola. Let the equation to be fitted in be 

y = a+bx + cx l . 

The normal equations are 

2y = ma + bZx + cZx 2 , 4 
2xy=aEx + bEx*-\-c2x 3 t v ...(8 ) 

Zx 2 y = aZx* + bZx 3 + cZx\ J 

which can be solved for a, b, c. (Agra B. Sc. ’63) 

Exponential curve. y = ae b *. 

This equation can be ieduccd to the linear form on taking 
logarithms. 


(Agra B. Sc. ’63) 


246 


STATISTICS 


log )0 ; ,= =log 10 a+bx log 10 e. ,.. (9) 

The curve is of the form Y=AX+B t where y=log 10 j>, 
A=b log 10 e, B= log 10 a. 

Applying the method given for a straight line, we can find 
the values of A and B and thus a and b can be evaluated. 

10*3. Solved Examples. 

^ 1. Fit a straight line to the following points :— 




o 




X 

y 




0 

1 


12 3 4 
1-8 3-3 4-5 6-3 

- -' (Agra M. Sc. 5 49) 

Since the number of values given are odd, we take the 
middle value /. e. the third as our assumed mean f.e. we put 
u =x—2, so that 


Total 


u 

y 

uv 

u* 



0 1 2 

3*3 4-5 6*3 

0 4*5 12*6 

0 1 4 


0 

16*9 

13*3 

10 


The normal equations for the line y=a+bu are 

Zy=ma+bEu or 16*9 = 5o, 

Zuy—aZu+bZu 2 or 13*3=106, 
giving 0=3*38,6=1*33, 

so that the equation is _y=3*38 + l*33u 

=3*38 + 1*33 (x-2) 
or ^=*72+l*33x. 

0 ^ 2. Fit a second degree parabola to the following data taking 
x as the independent variable :— 


x: 1 23456789 

y: 2 6 7 8 10 11 11 10 9 

(Agra M. Sc. ’53, Delhi B. Com. *54) 
Here the number of values given is 9, an odd quantity. So 
we take the middle value i.e. 5th as our assumed mean. Hence we 
introduce two new variables u=x—5 and v=y—7. 

Let the second degree parabola to be fitted be v=o+6«+c« a 


u 

-4 

-3 

-2 - 

1 0 

1 

2 

3 

Total 
4 0 

V 

-5 

-1 

0 

1 3 

4 

4 

3 

2 

11 

uv 

20 

3 

0 - 

1 0 

4 

8 

9 

8 

51 

w* 

16 

9 

4 

1 0 

1 

4 

9 

16 

60 

irv 

-80 

-9 

0 

1 0 

4 

• 16 

27 

32 

-9 

w s 

w 1 

256 

81 

16 

1 0 

1 

16 

81 

0 

256 708 



METHOD OF LEAST SQUARFS AND CURVE FITTING 


247 


The normal equations according to equations (8) of § 10*2 are 

ll = 9a+0 + 60c, 

51=0-f 60/> + 0. 

—9 = 60a 4-0 4-708c. 

Solving these equations, we get 

<7 = 3, b = ’ 85, c=—*27. 

Hence the parabola is 

v=34- *85w— *27 m ? 

or >» —7 = 34- '85 (x-5)-*27 (x-5) 2 

or y=-l+3*55x-*27x 2 . 

Note. When the values of * are equidistant, the calculations 
can be simplified by taking the common distance as the unit of 
measurement and the middle value as the assumed origin if the 
number of given values is odd; in case it is even, we can choose 
the origin to be at the middle point of the two middle values and 
half the common difference i> taken as the unit of measurement. 

3. Show that the line of fit to the following data is given by 
y='7x+11’28 : 

x 0 5 10 15 20 25 

y 12 15 17 22 24 30 

(Agra B. Sc. ’63) 

There are six values and the common dilference in the values 
of * is 5. Hence we take the mean of 10 and 15 i. e. 12*5 as the 

x — 12 * 5 

origin and £ as the unit of measurement. Putting u= —'• 

v=y — 20, we get 


u 

-5 

-3 

— 1 

1 

3 

5 

Total 

0 

V 

-8 

-5 

-3 

2 

4 

. 10 

0 

uv 

40 

15 

3 

2 

12 

50 

122 

u* 

25 

9 

1 

1 

9 

25 

70. 


The normal equations are 

0 = 6a -f 0. b 


giving 

The line is 

or 


or 


122 = 0.a-b70/> 

<1=0, b= 1 * 743. 

v = I • 743« 

y- 20 = 0-743) X “. 3 - 5 
= *7* —8*71 5 
y=> 7x4-11-285. 


248 


STATISTICS 


4. Write the normal equations for fitting the curve pv fn =k 
where k is a constant , p and v are the pressure and volume of a gas. 
Fit this curve to the following set of observations taking p to be 
independent variable : 

p (kg./cm. 9 )... 0-5 1-0 1-5 2'0 2‘5 5*0 

v (litres)... 1620 1000 750 620 520 460. 

(Agra B. Sc. ’60) 

pv m —k or log 10 p+m log 10 v=log 10 k 
or log 10 v= l - log 10 k- ] ~ log 10 p. 

This is of the form y=a-\-bx where 

>’=log 10 v, a=~ log 10 k, b =—)rf *=log 10 p. 

The normal equations are 

Zy=na+bZx or 27 log 10 v=6.~j log 10 k— ^ 27 log 10 p 

Zxy^aEx+bZx* or 27 log 10 vxlog 10 p=^ log 10 k 27 (log 10 /?) 

s (iog | 0 P r- 

Total 

x (logiu P) 1*6990 0*0000 0-1761 0'3010 0'3979 0*4771 1*0511 

_v(log 10 ») 3-2095 3-0000 2-8751 2 7924 2*7160 2*6628 17*2558 

*y {<log, 0 A)(log 10 1 >)} —*9661 0-0000 0-5063 0*8405 0*1081 0*1271 0-6159 

x * (log 10 /*)* 0 09061 0-0000 0-03100 0-9061 0*15830 0-22760 0-59812 

(It may be noted that 1*6990= — 1-f *6990= — *3010.) 

Substituting these values of 27 jc, 27y, 27 xy, 27x 2 in the normal 
equations, we get 

17*2558= - log 10 £—— X 1*0511, 
m e10 m 

0*6159=- log 10 fcxl*0511--x0*59812 
m m 

or 6 log l0 £—17*2558m= 1 *0511, 

1*0511 log 10 A:—0*6159/n=0*59812. 

Solving these simultaneous equations, we get 

01=1*7, log 10 k= 4*93. 

On taking anti-log., we get £'=85110. 


METHOD OF LEAST SQUARES AND CURVE FITTING 


249 


Hence the required equation is /7v 1 * 7 =85110. 

5. Derive the least square equation for fitting a curve of the 

type (i) y=ax + (ii) y=ax b to a set of n points . 

(Delhi B. A. Hons. ’58) 

(i) y=ax + ~. 

Let the n points be (x { , y ( ), /= ], 2,.. . n . 

The residual for the ,'th point is Hence the sum 


of the squares is 


S= £ fy ( — ax { — 
i~ 1 V xj 


= 0, giving £xy=a£x*—nb, 
=0, giving £ y =na-b£ 

X 


...d) 

...( 2 ) 


For extremum values, 

95 
da 
dS 

db x x 

(1) and (2) are the normal equations. 

(ii) y=ax b . 

, ogio>'=log 10 a + b log 10 *. 

This is of the form Y-A-\-BX. 
where F=log 10 >>, A = log l0 a, B=b and A'=log I0 x 

which can be treated in the manner already discussed for the case 
of fitting a straight line. 

6 ‘ Form norma [ equations and hence find most plausible values 
of x and y from the following equations : 

x+y=3-01, 2x y=0'03 t x+3y=7-03. 3x+y=4'97. 

T . .. (Punjab B. A. ’53, Agra M. Sc. ’61) 

1 he residuals are 

x+y-3-0l, 2x-y-0-03, x+3y-7-03. 3x+y-4-97. 

I he sum of the squares 

S^ix+y— 3'01) f +(2x— y— 0'03)* + (x f 3;-—7'03)* 

+ (3x+>--4*97) 2 . 

For extremum values £-0. 3 *=0. We get the normal 
equations as 

*+>’-3-01+2 (2*->’-0 03) + (x + 3>’-7-03) + 3 (3x+>.-4-97) = 0 

*+>’-3 01-(2x-^-0-03 ) + 3 (x+3>-7 03) + 3x + >’-4 97 = 0 



250 


STATISTICS 


or 


Solving, 


or 

giving 


15*-f 5y—25*01=0, 

5*4-127- 29*04=0. 

* y _ 1 

—145*204-300-12 — —125*054435-60 180—25 

x y 1 

T34 7 92 = 310-55 = 155* 

*=•997, 37=2-001. 

Exercises 


1. Fit a straight line to the following data treating y as the 
dependent variable :— 

* 1 2 3 4 5 

y 5 7 9 10 11. 

[Ans. 7=3*94-1*5*] 

2. Fit a straight line to the following data regarding y as the 
dependent variable :— 

x 0 5 10 15 20 

y . 12 15 17 22 24. 

[Ans. ^=*62*4-11*8] 

3. Show that the line of fit to the following data is given by 
7=-*5*48. 

*67788899 10 

y 55454343 3. 

4. Fit a straight line to the following data, showing the produc¬ 
tion of a commodity in different years in Punjab. 

Years *: 1911 1912 1913 1914 1915 

Production y : 10 12 8 10 14. 

(1000 tons) 

Hint. Take 1913 as origin for * series. [Ans. 7=0-64-10*8] 

5. The weights of a calf taken at weekly intervals are given 
below. Fit a straight line using the method of least squares, 
and calculate the average rate of growth per week. 

Age * : 1 2 3 4 5 6 7 8 9 10 

Weight y : 52*5 58-7 65-0 70*2 75*4 81*1 87*2 955 102 2 108 4. 

(Delhi B. A. Hons. ’56, Madras B. A. ’45) 
[Ans. 7=45-7446-16*. Average rate of growth per week is 

6*16 units] 


6. Fit a parabola of the second degree to the following data :— 
x : 1-0 1*5 2*0 2-5 3*0 3 5 4'0 

7 : 1-1 1-3 1*6 2*6 2*7 3-4 4*1. 

[Ans. 7=l-04-0-20*40-24* 2 ] 


METHOD OF LEAST SQUARES AND CURVE FITTING 


251 


Fit a parabola of the second degree to the following data 
taking x as the independent variables :— 
x : 0 1 2 3 4 

y : 1 1*8 1*3 2*5 6 3. 

Find out the difference between the actual value of y and 
the value of y obtained from the fitted curve when x=2. 

(Agra M. Sc. ’56, Lucknow B. Sc. *52) 
[Ans. y= I'48-F 1*13 (x-2) + 0-55 (x-2)-. 

Difference* —0 18] 

Fit the curve y=ae bx to the following data, e being Napierian 
base, 2-71828 

x : 0 2 4 

y : 5-012 10 31-62. 

(Ans. y= 4*642. <?°‘ 46x ] 
Use the method of least squares to determine a and b in the 

formula y=ax 2 -\-^ for the following data :— 

X 

x : 1 2 3 4 

y: -1*51 0-99 3-88 7*66. 

[Ans. a = 0*509, = — 2*06] 
Find the most plausible values of x and y from the four 
equations :— 

(i) *—>-+2z = 3, 3x + 2y-5z = 5 f 4x+y f 4z=21 

—x t-3y-!-3z= 14. (Punjab B. A. 56) 

(ii) x+2y+z=\, 2x+;’+z = 4 . -x-f-.y-4-2:r = 3, 

4x + 2y— 5z = — 7. (Punjab M. A. 45) 
(Ans. (i) x = 2-47, >-=3-55. z=l-92, 

(ii) x=l*16, y= -0*76, 2 = 2*80] 



CHAPTER XI 


BIVARIATE DISTRIBUTION, REGRESSION 

AND CORRELATION 

11*1. Introduction. So far we have dealt with one variable 
e. g. the distribution of people according to heights or weights, 
annual rainfall in a certain area, the agricultural yield and so on. 
Now if we measure both heights and weights so as to find some 
sort of relation between the two—whether tall persons are heavier 
than short ones, an increase in rainfall results in greater agri¬ 
cultural yield, more consumption of rice results in increase of 
birth rate and so on, we may find that a change in one variable 
results in a direct or inverse change in the other or does not have 
any effect on the second variable. The relationship between two 
variables such that a change in one variable results in a positive, 
or negative change in the other and also a greater change in one 
variable results in a corresponding greater change in the other, is 
known as correlation. If the second variable is unaffected by a 
change in the first e. g. the heights of the fathers and marks in 
mathematics of the sons, they are said to be statistically indepen¬ 
dent. 

11*2. Scatter or Dot Diagram. Suppose we measure the 
heights and weights of a certain number of people ; denote the 
quantities by x and y and plot them on a graph paper referred to 
two perpendicular axes. For each set of observations on one 
person, there shall be one point and thus we get scatter diagram 
as in figure 1 (a). 




Fig. 1 (b) 


Fig. 1(a) 



BIVARIATE DISTRIBUTION, REGRESSION AND CORRELATION 253 


If the origin of axes is taken at (3, y), where 3, y are the 
means of the values of x, and y respectively, the points may be 
scattered all round the origin. For points lying in I and III 
quadrants, the product (x — 3) (y — y) is positive and for those 
in II and IV quadrants, it is negative. A cluster of points in 
the first and third quadrants indicates that there is a positive 
correlation. Large positive or negative values of x correspond to 
large positive or negative values of y. Similarly a cluster in second 
and fourth quadrants indicates negative correlation. Large positive 
or negative values of x correspond to large negative or positive 
values of y respectively. Thus 27 (x, —3) (y ( — y) gives a measure 
of correlation between x and y. 


Karl Pearson’s Coefficient of Correlation. Karl 
(1857—1936) defined a coefficient of correlation by 

jj- 2 (*<-*) O'i-y) 

a x a 9 

= E lx 3) (y-y) 

VIE (x—3;*.L’ (y—yj 2 ] 

__ co-varian ce (x, y) 

{(variance of x) (variance of y)) xrl 


since 


1 £ x,>’,-s27x,-v27>>,4-7V3y 

N f 1 „ o 1 1/2 f I ni/z 


If £ Wi-ty 


£x ( 


[w-ns^r 

Ns, 27 y t = Ny. 


Peaison 



...( 2 ) 



Note that r has no units and is a mere number. 

11*3. The value of r is independent of the origin of reference 
and the scale of reference. (B. A. Hon’s Delhi) 

*-x n 


Proof. Let u = 


v- v-yc . 
k 


; then 


x = uh + x 0 , y=vk+y (j . 

We get * = «*+*„, y=vk+y a , [(5) of § 3'4 

P • 36], 



254 


STATISTICS 


Substituting these values in (1), we get 

1 2 (u—u) (v—v) 


r= 


N 


a u°v 


This result is very helpful in the calculation of coeffi¬ 
cient of correlation since we may take any convenient origin for 
x and y and also change the scale by dividing by the class interval 
and the formula applicable shall remain unchanged. This fact 

has been used in the example solved below. 

Positive or negative values of r indicate positive or negative 
correlation while if r=0, the variates are statistically independent. 

11-4. Sterograms and Correlation Surface. For a group 
bivariate data, we take two mutually perpendicular axes in a hori¬ 
zontal plane, the units of measurement being the class intervals 
of the variates. We thus have a network of class rectangles. On 
each rectangle we erect a vertical prism with height in proportion 

to the frequency of that class. We thus get a pnsmogram or 
■aerogram in three dimensions corresponding to histogram in two 
dimensions. If the class intervals become smaller and smaller, 
the number of observations increasing infinitely, the sterogram 
lends to a smooth continuous surface known as the correlation 
surface. For grouped distribution, 

NS f.xo’i-il fiXi) (2 f ( yi) 
i _ j j ___ 

= VI{JV Zfixf-iZfiX,)*) {N 

i i * * 

N2fu v—(2fu) (2fv) 

= Vl{A^/w 2 -(£/«)*) {N2fv*-(2fv)~}] 

for step deviation method, 

, v _y-yo 

where it ——* 

Example. Calculate the coefficient of correlation between the 

values of x and y given below 

x 78 89 97 69 59 79 68 61 

y 125 137 156 112 107 136 123 108 

/ you may use 69 as working mean for x and 112 for y.) 

(Delhi M. A. ’53) 


r= 


BIVARIATE DISTRIBUTION, REGRESSION AND CORRELATION 255 


X 

d 

1 

Deviation 
from 69 

(O 

f 

S 2 

y 

Deviation 
from 112 

v) 2 

ft 

00 

9 

81 

125 

13 

169 

117 

oc 

SO 

20 

400 

137 

1 25 

625 

500 

97 

28 

784 

156 

44 

1936 

1232 

69 

o 

0 

112 

0 

0 

0 

59 

-10 i 

100 

107 

— 5 

25 

50 

79 

10 

100 

136 

24 

576 

240 

68 

— 1 

1 

123 

11 

121 

-11 

61 

-8 

64 

108 

—4 

16 

32 

Total 

| 

48 

1530 


108 

3468 

2160 


N = 8 


__ NESn-jEZ ) (Z /j)_ 

y/\.{KNE?)-KEW)\\NEi?)-{Ev ,*)j 

8x216o-(48) (108) 

VL{^ X 1530—(48) 2 } (8 x 3468 —(108)*}] 

17280-5184 
~V(9936x 16080) 



11*5. Probable Error of coefficient of correlation. Since we 
calculate the value of r from a sample, it is subject to error of 
sampling. Probable Error is the limit so that the probability 
that for a random sample, the value of r lies between 
r ± (Probable Error) is exactly A provided r is distributed normally. 










256 


STATISTICS 


According to Secrist, the probable error of the correlation coeffi¬ 
cient is an amount which if added to and subtracted from the 
average correlation coefficient produces amounts within which the 
chances are even that a coefficient of correlation from a series 
selected at random will fall . 


The formula for probable error is 



The following rules are observed for P. E. :— 

(i) If r < P. E., there is no correlation. 

(ii) If r < 6 P. E., the presence of correlation is strongly 
suggested. 

It will be proved later on that —1 ^ r ^ 1 so that P. E. is 
never negative. 


11*6. Solved Examples. 


1. Calculate r between the ages of husbands and wives from 
the following data and find its probable error. 


Age of 
wives 

20—30 

Age of husbands 

30—40 40—50 50—60 

60-70 ; 

1 

Total 

15—25 

' 5 

9 

3 0 

0 

17 

25—35 

0 

10 

25 2 

0 

37 

35—45 

0 

1 

12 2 

0 

15 

45—55 

0 

0 

4 16 

5 

1 

25 

55—65 

0 

1 

1 

0 

0 4 

2 

1 

6 


5 

20 

44 24 

7 

100 


(U. P. P. C. S. ’38, ’60) 


BIVARIATE DISTRIBUTION, REGRESSION AND CORRELATION 257 


Age of 
\husbands 

\ y ■ 

4 

Age of\ 
wives x \ i 

20-30 

1 

30-40 40-50 

50-60 

• 

60-70 

I 

2 

c 

H 1 

| 

u 

fu 

1 

fu 2 

27/v 

uZfv 

15—25 

5 

9 | 

3 

0 


17 

-1 

-17 17 

-2 

2 

25—35 

0 


25 

2 


37 

0 

1 

2 


| 

29 

0 

35—45 

0 


12 

2 

- 


15 

15 

16 

16 

45—55 

0 

0 

4 

16 

5 

25 

25 


51 

102 

55—65 

0 

0 

0 

4 

2 

6 | 

3 

18 

66 

54 

: 14 

42 

Total / 

5 

1 20 

44 

24 




162 

V 

-1 

• -■ --. 

0 

1 

2 

3 


t 

I 

• 

_ J 


/v 

-5 

0 

44 

48 

21 

108 

/v* 

5 

0 

44 

96 

63 

208 

27 fu 

-5 

-8 

17 

46 

I _ 

16 

66 

vZfu 

5 

0 

17 

92 48 

162 ( 


Method. We assume the origin of x and y to be at the mid. 

x ' _40 

values of 25-35 and 30-40 classes respectively and put u= ——— , 
v' —35 

v=—where x* and y* are the middle values of xandy class 

intervals respectively and 10 is the width of class interval in 

both x and y. Note that it is not necessary that the class 
interval may be the same for both variates as it will 
in no way affect the calculation The working is done 

as shown in the table, the entries in the fu column being 
the sum of the products of u ai.d/in each row, similarly under 
fu 2 column the product of u and fu in each row and so on. 
There are two checks on the correctness of calculations shown 
by arrows. 































258 


STATISTICS 


Calculation :— 


NSfuv-(Zfu) (Zfv) 

r ~y/{{NS fu‘-{S fur) {NS fv*-(Sfvy}]- 


where N=S f 

100x162-66x108 

•** r ~ v(100x 186-(66) 2 j {lOOx 208—(108)*} 

_90*72 

117 

=*775 nearly. 

Probable Error—*6745 —~ . 

V-A' 

=•6745 -_ IJ-LzL 

vioo 

= •027 nearly. 

2. Show that if x\ y' are the deviations of the variables 
from their means , 


r— 


Probable Error=*6745 


6745 


-'-Afg-S)’- 


Hence deduce that —l^r^J. 

We have 

-40+l±2r) 

= l±r 

which are the required results. 

Since the extreme right hand side term is the sum of squares 
of real quantities, it is never negative. Hence 

-1 <r<l. 

3. If u=ax + by+cz +..., the constants a, b, c,„.being either 
positive or negative, then show that 

of=arc x 2 +b t Gf-\-c 2 c *+... -4 -2abr xt o x a v +... 

Now u=ax+by+cz+... 

u=ax+by+c 2 +... 

so that w'=ax'-f&y # +cz'-f.... 

the primes indicating the deviations from respective means. 
Squaring, «' 2 =o 8 x' 2 +6 2 / a +c 2 a' 8 +... +2abx’y'+ ... 

On taking expected values, 

E (u' s )=a 2 E (x' 2 )+6 2 £ (/*)+... +2 abE (x'y') +... 
or G,*=a 2 o x 2 +b 2 G y 2 +c 2 o* + ... +2abr x „<7 x o v + ... 


1 


BIVARIATE DISTRIBUTION, REGRESSION AND CORRELATION 259 


4. If a linear relation exists between the variables x and y, 
then the coefficient of correlation between them is -f 1 or — l. 

(Agra M. Sc. ’62, Aligarh S. ’60) 

Let the relation be y=mx-\-c and the number of observations 
be N ; then y = //?>? + c, so that y—y = m (x — 

2 (x x) ( y-y ) 

m2 (x—s) 

or r= T - 

jfVim 3 & (x-*m 

= ± 1. 

Note that the sign of r is the same as that of convariance i.e. 

2 (x —x) (y —y), the denominator a x a v being always positive. 
Hence in this case, r=l if m is positive and --1 if m is negative. 

5. If x lt x t and x 3 be uncorrelated variables each having 

the same standard deviation, obtain the coefficient of correlation 
between x t -f x 2 and x t + x 3 . (I. A. S. ’49) 

Let £=-Xi + x 2 and /) = x 2 + x s , 

then ^=^i+.v. a and y)=x 2 +Va, 

£ — £= (*i — *i) + (* 2 - **) = x,' + x/ 

yj _ yj = (X 2 - jc 2 )-f (x., - .v 3 ) = x 2 '+ X 3 ', 

where x,', x 2 ', x 3 ' are the deviations of x lt x 2 , x 3 from their 
means respectively. 


r =_ s ( 5 — 5 ) l y-'-Q ) 

U Wtt-b'.Zin-T))*} 

2 (Xi' + X,') (xf + Xn) 


V{2 ix 1 ' + x a '>*.2/ tXo"-f-x 3 ') 2 } ’ 

27x 2 ' 2 


V(£ (x^+x^.^tx^-fx^)} 
since x 1? x 2 , x 3 are uncorrelated, the covariance between them 
is zero i. e. 2x l 'x i ’=2x t 'x a '=2x l 'x 3 '=0. 

No 1 


Hence 


r_ = 


V'{A^ (a z -t-« 8 ;. Af (a‘+o 2 )} 

I 

= 2 ' 

since the S. D. of each x lt x 2 , x 3 is the same. 



260 


STATISTICS 


6. If x=au+h, y = cv+d, where a, b, c, d are constants, 
show that the correlation coefficient between x and y is the same as 
that between u and v. What is the practical utility of this result . 

(Andhra ’54) 


See $ 11*3. 

7. If x and y are two correlated variables with S. D. a x and g 0 
and correlation coefficient p, find the correlation between ax+by 
and cx+dy, a, b, c, d being constants. Find k for which x-\-ky 

and x+y are uncorrelated. (Madras > s6) 

Let u = ax + by, v=cx+dy. 


then 


r uv = 


u=a$+by, v=c$+dy t 

u- u=a (x-Z) + b (y-y), v-v=c (x-3)+</ ( y-y ) 
E (u—u) (v-vj 


y/{E (u—u ) 2 .E (v— v) 2 } 

_ E [{a (x— 3)4-6 (y —?)} {c (x—X) + d (y —>*)}] 

VL E {a ( X -g)+b (y-y)}*.E {c (x-Z)+d ( y-y )} 2 ] 

_ acE (x-$) 2 +bd E(y-y) i +(ad+bc) E(x—$) (y—y) _ 

y/[{a'E (x—xf+b 2 E (y—y) 2 +2abE (x—%) (y—y)} 

y.{c 2 E (x—x) 2 +d 2 E (y—y) 2 +2cdE (x—5?) (y— V)} 


__ aco x 2 + bdG 2 +(ad+bc) G x a v p _ 

V [{a 2 o x 2 •+ b 2 a v 2 + 2 abo x o v p) {c e a x 2 +d 2 a 2 +2cdG x G y p}] * 

If u and v are uncorrelated, r uv = 0. The quantities given are 
x+kv and x+y so that in the numerator of the above expres- 

G y 


sion substitute a=l, b=k t c=l, </= — and equate it to zero 


We get 


or 


G x 2 + k cr* 2 -f T— a x a v p = 0 

Gy \Oy / 

Ox {G x + ka y ) (1+P)}=0 


giving 



8. If x and y are two correlated variables with the same S. D. 
and the correlation coefficient r, show that the correlation between 


x and x+y is 



(Madras ’56) 


BIVARIATE DISTRIBUTION, REGRESSION AND CORRELATION 261 


In the result of the last example put a= 1 , b=0, c= 1 , d= 1, 

a,=a t = a. 

9. If «=x-fy, v=x—y and x and y are independent , show that 

<V J =c r * z +<V J = a t> 2 * 

In other words var (x±y) = var (x) + var (y). 

(B. A, Aligarh ’60) 

Let w ~x±y, 

w=x±y. 


w—w={x— .v)±0>-y), 

(W-rv) 2 =(X— X) 2 + (y—y)*±2 (x-X) (>•->-), 

E (w-w) 2 =E ( X -x) 2 +E (y-y) 2 ±2£ (x-.v) (y-y), 

°ic 2 =of + af ±2a x a ll r xv . ...( 1 ) 

Since x and y are independent, r*„ = 0. 

SO that a u B =a 0 2 =C7 x 2 -f a v 2 . 

Also when x, y are correlated, we get from (1), 


r *v = 


*V ”l* Gy 2 — fT )f ^ 


...( 2 ) 
(Madras ’40) 


2(7 x O v 2(7 x c v 

according as w=x-\-y or x—y respectively. 

10. Two variates x and y have zero means , the same variance 
o* and zero correlation ;show that 

x cos a-f y sin a and x sin a— y cos a 
have the same variance o 2 and zero correlation. 

S=0, y=0 and co-variance (xy) = 0, since /-,„ = (), 

Let u=x cos a-fy sin a and v = x sin a— y cos x 

u=s cos a-fy sin a=0 and v=x sin x-y cos a = (). 

Now u l =x t cos 2 a-fy 2 sin 2 a-f 2xy sin a cos a 
v 2 =x 2 sin 2 a -f y 2 cos 2 a—2xy sin a cos a. 

Taking expected values 

E (m 2 ) = £ (x 2 ) cos 2 a-f E (y 2 ) sin 2 a-f 2£ (xy) sin a cos a 
=a 2 cos 2 a-f o 2 sin 2 a. 

Variance ( u)=o 2 . 

Similarly variance (v) 

£ (tiv) 

£ {(x cos a-f y sin a) (x sin a—y cos a)) 





262 


STATISTICS 


{E (x 2 )—E (y 2 )} sin a cos a +E (xy) (sin* a—cos 2 a) 

= a 2 

=0, since E (x 2 )=E (y 2 )=o 2 and E (xy)= 0. 

11. If u=ax+by and v=bx—ay, where x andy represent 
deviations from the respective means , and if the coefficient of corre¬ 
lation between x and y is p but u and v are uncorrelated , show that 

o u o 9 —(a*+b 2 ) c x o w y/(l—p 2 ), 
u=ax-\-by and v=bx—ay, 

E ( u)—aE (x)+bE ( y ) and E ( v)=bE ( x)—aE(y). 

Since x and y are deviation from the means, E (x) and E ( y ) 
are both zero and hence 

E («)=«=0, E (v)=v=0 
E (u 2 )=E (ax+by ) 2 

=a 2 E (x 2 )+b 2 E (y*)+2abE (xy) 
or cf=a 2 of+b 2 cf-\-2abpc x o v . 

Similarly of = b 2 of + (Eof —2 abpo x o vt 

so that ofcf—^+b*) of of 4- a 2 b z (a x 4 + o, 4 ) 

-f 2abpc x o v \(a 2 —b 2 ) (of—of))-4a 2 b 2 p 2 ofof 
= (a 2 +b 2 ) 2 of of -\-arb 2 (of-of) 2 

+ 2 abpo x c v {(a 2 —b 2 ) (of—of)}—4a 2 b 2 p 2 ofof. 

. .( 1 ) 

But since u and v are uncorrelated, 

E (uv) = 0 

or E {(ux-f Z>y) (bx—ay )}=0 

or E {ab (x 2 —y 2 )-j-xy (b 2 — a a )}=0 

or ab (of—of)-t-p<y r c v (b 2 —a 2 )=0 

or a*/; 2 (of-of) 2 -\- p'Wof (b 2 -a 2 ) 2 

+ 2abp (o x z —of) (b 2 — a 2 ) o x o v =0. .. .(2) 

Subtracting (2) from (1), we get 

cfof=(a 2 +b 2 ) 2 of of— P 2 afof {(b 2 -a 2 ) 2 +4a 2 b 2 } 
=(a 2 +b 2 )'ofof(\-p 2 ) 
or o u o v = (ar+b 2 ) o x o 9 -f(l—p 2 ). 

11*7. Rank Correlation. Suppose n individuals are graded 
regarding same characteristic A , each being assigned a posi¬ 
tive integral number; for example, n students are examined in 
mathematics and the top scorer of marks is assigned number 1, 
the second 2, the third 3 and so on, the one getting least marks 
the number n. Similaily they may be graded regarding another 
characteristic B, say marks in English, and similarly- assigned 
cardinal numbers. It is required to find a correlation coefficient 


BIVARIATE DISTRIBUTION, REGRESSION AND CORRELATION 263 


between their grades in the two characteristics. Or, we wish to 
investigate the degree of agreement between two judges who have 
graded or ranked the same individuals regarding the same charac¬ 
teristics, for which there is no known objective standard of 
measurement. 

Let the ranks assigned to the same characteristic A be x„ 
where i=l, 2, ...,n and X, can take values between land/?, 
no two values of x being equal and similarly, let the ranks regard¬ 
ing the second characteristic be y it /= 1, 2,3, ...,/i. 

Now (l+ 2 + 3 + ... +/*) = -~ 1 =y. 



(1 2 +2 2 + 3 2 + ...+/j 2 )—5 2 = 


(« +1) (2/; + l)_^/3+l^ 


1 __ . 

~ 12 “ cV * (U. P. P. C. S. ’58) 

Let di^Xi—yi 

= (x,-3)—0\—y), 

2 e (x-s) (y-y) 

=c* 2 + o v 2 —2 co-variance (x, y ). 

Now we define coefficient of correlation U, by the relation 

/j = co ‘ var * ance (*. y) 

= * 

2<7 e O ¥ 

Substituting the values of c z , o t , we get 

« 3 -/» * (I. A. S. '56) 

This is known as Spearman's Coefficient of Correlation. 

11*8. Show that the coefficient of rank correlation lies between 

— / and 1. 


The minimum value of the coefficient of rank correlation is 
when the two rankings are exactly the reverse of each other, i.c. the 
case in which the rankings are as follows : 

*i 1 2 3. n — 1 n, 

y ( n n—1 n — 2 . 2 1. 

In this case x< +y», = n + 1 for all /, 
sothat £(x { + yi ) z =n (/r+1) 2 _(1) 


264 


STATISTICS 


and 




n («+l) (2/i+D 
“ 3 

Subtracting (2) from (1), we get 

2 S Xi y t =n {n+l)*- n (n+l \ (2n+1) 
* i * 

n (n+ 1) (n-f-2) 

~ 3 


ZXi}’i= 


n (w+1) (n+2) 


-..( 2 ) 


Also 


so that 


cov (x, y) = - 2x { yi-*y 

1 n (w+1) in- f-2) 
~n' 6 

n 2 —1 
12 * 

n 2 — 1 

var (x)= — yY' 

n 2 — 1 
var (>»)=-—, 


-{ 


(n+l )) 2 


R= 


12 * 

cov (x, y) 


o x a v 


= -l. 


The maximum value of R is when the two rankings exactly 
coincide so that (x<—y<) vanishes or d { =0 for all i. 

Applying this value of d it we get the maximum value of R to 

be +1. 

11*9. Solved Examples. 

1. The figures in the following table give the number of 
criminal convictions and the number unemployed (in millions) for 
the years 1924 — 33. Find the coefficient of rank correlation. 



1924; 1925! 1926 1927j 1928 i1929,1930 1931 1932 1933 


Number 
convicted 
of crime 


Number of 
unemployed 



1-26 1 -24 1 *43 I -19 1*33 


•5 2- 


10-54,9*46 


67 2-782-26 






BIVARIATE DISTRIBUTION, REGRESSION AND CORRELATION 265 


The ranking of data is as follows : 



19261927 


Number 

convicted 


Number of 
unemployed 







5 10 


4 16 




1 16 


so that £</,- 2 =48, n=10. 
< 

Hence 


/?= 1 - 


6x8 

990 


bv three 


=•709. 

2. Ten competitors in a beauty contest are ranked 
judges in the following order / 

First judge 1 6 5 10 3 2 4 9 

Second judge 3 5 8 4 7 10 2 1 

Third judge 6 4 9 8 1 2 3 10 


Use the rank correlation to discuss which pair of judges have 
the nearest approach to common tastes in beauty. 

(Allahabad M. A. ’52) 

Rank correlation between the ranknigs of 1st and 2nd judges : 



vO 

5 

5 

OO 




1 ' » 


10, Zd* = 200. 
6 Zd- 

1 n (* 2 - 1 ) 

6 x 200 
990 

-* 2 . 


1 - 

















266 


STATISTICS 


Similarly rank correlation between first and third judges, 

R 1} 3 = *64 

and between second and third judges 

R.2t 3= *3. 

Thus we see that the first and third judges have the nearest 
approach to their tastes for beauty. 

Note. If all the d’s are zero, R=l, showing that there is perfect 
correlation of rank between the variables, which is the maximum 
value of R. 

3. Calculate the coefficient of correlation from the following 
data by the method of rank differences : 

x 81, 78, 73, 73, 69, 68, 62, 58, 
y 10, 12, 18, 18, 18, 22, 20, 24. 

Coefficient of rank correlation 


Rank Rank 

)’i 


81 

1 

10 

8 

78 

2 

12 

7 

73 

3*5 

18 

5 

73 

3*5 

18 

5 

69 

5 

18 

5 

68 

6 

22 

2 

62 

7 

20 

3 

58 

8 

24 

1 


where d i —y { — x i . 


Rank differences df 


(d t ) 

7 

49 

5 

25 

1*5 

2*25 

1*5 

2*25 

0 

0 

-4 

6 

—4 

16 

-7 

49 


159 50 


159-50. 


We get 


Zdi 

I 


fjr 7 2 

Coefficient of rank correlation == 1—-—' 

n 3 —n 


= 159*50x6 

“ 8 <8-— 1 j 

= !- 957 
504 

= -0*Q. 

Note. In the above example, the two items of x have equal 
values 73, 73. The ranks awarded to them are the A. M.s of the 
ranks that they would have got if they had differed by a small 
quantity. Thus these items would have got 3rd and 4th ranks 
if they had a small difference and hence the rank awarded to each 


BIVARIATE DISTRIBUTION, REGRESSION AND CORRELATION 267 


is ~y~ i. e. 3*5, the next item having the rank 5. In the same 

way the y items having values 18, 18, 18 have each been awarded 
the average of 4, 5, 6 i. e. 5. The next item would have the rank 
7 in this case. 

11*10. Regression —Linear Regression. If there exists some 
relation between two variables, their scatter diagram shall have 
points clustering near about some curve. If this curve is a straight 
line, it suggests some linear relationship between the variables and 
his straight line is known as the line of regression. The term regres¬ 
sion was used by Sir Francis Galton who found that though ‘‘tail 
fathers have tall sons’, the average height of the sons of tall 
fathers is x above the general height, the average height of their 
sons is lx above the general height This recession in ihe average 
height was described by Galton as ‘regression to mediocrity’. 
However the example is not universally applicable and the term 
regression is applied to other types of variables. Just as the term 
imaginary numbers is still in use although it is now realized that 
there is nothing imaginary in imaginary numbers, similarly the 
term ‘regression’ has retained its name. 

We have already seen how to fit a straight lire to a dot diagram 
with the method of least squares. If we minimize the sum of the 
squares of the residuals of the ordinates between the points of 
equal abscissa on the line to be be fitted and those of the dot 
diagram, the line so obtained is known as the line of regression of 
y over x and if we minimize the residuals of abscissa between 
the points of equal ordinates on the line and on the dot diagram, 
the line thus fitted is known as line of regression of x over y. 
Except in particular cases, these two lines are distinct and hence 
there are in general two regression lines. Let the equation of line 
of regression be y=a-\-bx. 

The residual for the ith point is y i —a—bx i . 
and minimising the sum of the squares as in the previous chapter, 
we get the normal equations as 

Eyi—na — b£x, — 0, 

Zxtfi — aEx, — bZx? = 0. 

The first of these equations give 

y — a— bx = 0, ...(|) 

which shows that the line passes through (X, y). (Agra M. Sc. ’f>2) 



268 


STATISTICS 


Transferring the origin to this point, 
equation becomes 

Z (*,—3) (yi-y)-aZ (Xi-Z)—b2 (x { -X, 2 
• • • 

1 l l 

But Z (x r *)*0. 

X 

Hence Z (**—2) (y t —y)=bZ (Xj—3j* 

9 

X 


the second 


or 


=0 


Z (Xf-x) (y t -y) 
' Z 

ccv (x t y) 


= r —, 


where r is the coefficient of correlation between x and y. 

Hence the equation to the line of regression of y over x is 


y-y=r -»(x-x). 

c x 

Similarly the best line of fit of the form 

x=a'-{-b'y is 

x-3=r r (y-y), 




...( 2 ) 


which is the equation of the line of regression of x over y. 

If the two lines of regression coincide, the correlation between 
the variables is perfect, the condition being 

"*=!?-• or H-l 
o, r c x 

or r= ± 1 . 

If the variables * and y are independent, that is the coefficient 
of correlation between them is zero, the lines of regression of y 
on x coincides with x-axis and that of x on y with the y-axis and 
thus cut at right angles to each other. 

The coefficient of regression of y on x 

, _ covariance (x , y) 
v *~ var (x) 


r v 


»• - ( 4 ) 


and similarly the coefficient of regression of x over y 



BIVARIATE DISTRIBUTION, REGRESSION AND CORRELATION 


269 


b Jv — 


covariance (x, y) 


var ( y) 


=r —. 


...(5) 


It follows that b 


b xv =r.°\r ax =r\ 

Gx G u 


'yfX • U ZV 

Hence the Geometric mean of the coefficients is r. It is 
clear that the arithmetic mean of coefficients of regression is 
greater than r, since A. M. > G. M. (Agra B. Sc. ’63) 

11*11. To show that the coefficient of correlation lies between 


i 1 • 


S v — E (y i —a—ax i )* 


(Agra M. Sc. ’48, ’49, ’50) 


— 2 (O’*— y)~b (*< —s)} 2 by equation (I) of ij 1M0 
= f Uyi-V) 2 ~2bE (x t -X) ( yi -y) + &E (x,-x) 2 } 

= Naf-2bE (Xi-5t) ( yi -y) + MW 


o 
<*z 

= N °V 2 (I - z- 2 ), 


Na v 2 — 2 . yV/'a„CT v -f N. r 2 — n x ' 

a ~ 


since 

and 


b=sr from ec l uation (5) of § 1110 
2 (*<-*) ()'i-y)= Nra x <j v . 


N Sv ** 11 is the mean square deviation of the line 
of regression of y on x and is called the standard error of estimate 

Since Sf is the sum of the squares of real quantities, it is 
never negative and hence it follows that 

1-r 2 > 0 

° r -l<r<l. 

Alternative proof. 

then If ' /,, * Z ’* ,<7 " and b " b *'-- bn arctwo sets of real numbers, 

< a A+aA+...+«A)* < (V+««*+... 

Ih . +O (*,*+V+...+V), 

&,gn of inequality occurring only when 



270 


STATISTICS 


^r=p=.. • = r» (Wierstrass’s Inequality) 

U± t?2 On 

z (*<-*) (yi-y) 

Now r= y /{2 [Xi-xfTz (*-y)T 

i i * 

Putting x i —x=a i and yi—y=b { in the above inequality and 
squaring, we get r 2 < 1 or — 1 < r < 1 . 

The proof of the inequality is left to the student as an 
exercise. 


11*12. Solved Examples. 

1. If (3 be the angle between the lines of regression , show that 

tan (3=(I—r 2 ) o^ojr (<j x 2 +o v 2 ). 

Explain the significance of the formula when r=0, r=l. 

(Agra B. Sc. ’61, M. Sc. ’55, ’59) 

If m x and w 2 are the slopes of the lines and /3 is the acute 
angle between them, 


tan -- 

1 +m x m t 



= (1 —r 2 ) c x ajr (g x 2 -\-o 2 ). 


If r=0, tan /3 = oo, /3=^ i. e. when the variables are indepen¬ 
dent, the lines are perpendicular to each other, while if r= 1 , the 
lines coincide since /3 = 0 in that case. 

2. The following table enumerates the marks obtained by a 
class of students in Statistics I and II papers. 


Marks in first paper x : SO 45 55 56 58 60 65 68 70 75 85 
Marks of second paper y : 82 56 50 48 60 62 64 65 70 74 90 
Calculate the coefficients of regression. 

(Agra B. Sc. ’56, I. A. A. ’45) 


BIVARIATE DISTRIBUTION, REGRESSION AND CORRELATION 27 1 


Paper I Paper II 


Marks 

Deviation 
from ass. 
mean 65 

(Devia¬ 

tion, 2 

Marks 

Deviation (Devia- 
from ass. tion) 2 
mean 70 

Product of 
Deviations 

X 

* 

e 

y 

0 

o 

T 

*0 

80 

15 

225 

82 

12 

144 

180 

65 

20 

400 

56 

-14 

196 

280 

55 

-10 

100 

50 

-20 

400 

210 

56 

-9 

81 

48 

-22 

484 

198 

58 

_"7 

— / 

49 

60 

-10 

100 

70 

60 

-5 

25 

62 

-8 

64 

40 

65 

0 

0 1 

64 

-6 

36 

0 

68 

3 

9 | 

65 

-5 

25 

-15 

70 

5 

25 

70 

0 

0 

0 

75 

10 

100 

74 

4 

16 

40 

85 

20 

400 

90 

20 

400 

400 

Total 

2 

1414 


-49 

1865 

| 

1393 


Number of observations N=[\. 


Now r - d fr)_ 

V {\NE?) -( Zl l 2 } {(yvzy) -(27 yjr, 



= _1 l x 1 393 — 2 (-49) 

VUil X 1414 — 2 Z ) (11 x 1393 — 49* 
= '918, 




“VhVxl414-(Ai*J 
= V( 128*513) = 11*33. 


Similarly a.- %/jL *,-(* r ,)’} 

= V( ,‘. x 1865 —( — 17)-] = 12 23 


272 


STATISTICS 


Coefficient of regression of v over x is 



12*23 

11-33 


= •99 nearly. 

Similarly coefficient of regression of x on y is 


6„=r-'=-9l8x 

Oy 


11-23 

12*23 


= •85 nearly. 

3. The equation of two regression lines obtained in a correla¬ 
tion analysis of 60 observations are 5x=6y-\-24 and 1000y=768x 
— 3608 . What is the correlation coefficient and what is its proba¬ 
ble error ? 

Show that the ratio of the coefficient of variability of x to that 
of y is - 2 £ 4 -. What is the ratio of the variances of x and y ? 

(Madras *42) 



. c 7 V 768 

fc «* =r o.~!ooo 

• U U _ r 2_fi v .lSP_.QT1 f. 

.. u yZ‘ U Xy -' - 6 ^ 1000 - 7410 . 

Hence r= ± -95. 

Since b xu and b Xv are positive, the correlation is positive and 
hence r= - 96. 


2 l_r 

Probable Error of r=- — =-02 

3 y/n 

Solving the given equations, .v = 6, y = l since the regression 
lines pass through (3, y). 


Since 


r — =£- and r = *96, we 



o ? _ o _ 6 
(7^5 x *96 4 * 

Also the ratio of the coefficients of variability=^ 

== 0 x * 

=-A. 


y ^ 

S o, 


Note. Predicator and Predictant. The regression lines are 
used to lind the value of a dependent variable generally y for a 
known value of x, the independent variable. In such a case 
only one line of regression that of y on x is required. The 
independent variable is called predicator and the dependent vari¬ 
able the predictant. 


BIVARIATE DISTRIBUTION, REGRESSION AND CORRELATION 273 

4. A study of wheat prices at Hapur and Kanpur yields the 
following data :— 

Hapur Kanpur 

Average price per maund Rs. 2-463 Rs. 2-727 
Standard Deviation •326 ‘207 

r between prices at 

Hapur and Kanpur + 0-774 

Estimate from the above data the most likely price of wheat 
(a) at Hapur corresponding to the price of Rs. 2-334 per maund at 
Kanpur ; (b) at Kanpur corresponding to the price of 3-052 per 
maund at Hapur. (P. C. S. U. P. 1938) 

(a) la the first case we take the price at Kanpur to be the 
independent variable (*) and at Hapur the dependent variable y. 
The equation to the line of regression of y over x is 

y-y=r^(x-x\ 


Ox 


326 


—2-463=0-774 x-^ (*-2797). 
Now when * = 2*334, we have 

^-2-463 = 0-774x;^ (2-334—2-797) 


or 

or 


y= 1*2171 x 2*334-3*404 + 2*463 

= 1*899. 

Hence when the price at Kanpur is Rs. 2*334, that at Hapur 
is Rs. 1*899. 

To calculate the price at Kanpur for a given value at Hapur 
we write the regression equation of * over y. ’ 

x—X=-r —* (y—y), 

a w 

007 

x-2*797=0*774 x.^g 0>-2*463). 

When >’=3 052, we have 

•707 

*=2*79 + 0*774 x.^J' (3*052-2*463) 

= 3*086. 

Hence when the price at Hapur is Rs. 3*052, that at Kanpur 
is Rs. 3*086. 

5. For two variables x and y with the same mean, the two 
regression equations are y=ax+b and x=*y+( 3 . Show that 
b 1 —a 

Find also the common mean. 


(Madras *56) 



274 


STATISTICS 


Let the common mean be the two lines of regression are 

y— m ^a (x—m), 
x—m=<t ( y—m). 

Comparing with the given equations, we get 

b=m(\—a)* 

(3=m (1 —a). 
b 1 —a 

Dividing, we get 


Evidently, 


/3 — 1—a* 

b P 

m =~.— = 


1— a 1—a* 

6 A computer while calculating the correlation coefficient 
between two variables x and y from 25 pairs of observations obtamed 

the following results : 

n=25, Zx=I25 , Zx'=650 , Zy=100, 2y*=460, Zxy-508. 

ft v • < I. La 


jl rrwj -- 

copied down two pairs as x 


X 

y 

6 

14 

8 

6 


Ul lilt UfHC UJ -- - 

y while the correct value was x \ y 


8 


12 


8 


Obtain the correct value oj cor retain ^ Hon , s ^ 

The correct data would be 

n= 25, 2 x=\25, 2 * 2 =650, 2y=lC0, 2 /= 460 - 232 + 208 =436, 
2 *>’=508-132+144=520. 

{As it may be seen 2 x, 2 x* remain the same in ^otb cases; 
2 y is also unchanged, but the sum of the squares of y s in the first 
table is 232 and 208 in the second table; similarly the sum of xy 
fa the first table is (14x6+8x6) = 132 while m the second table 

“ " 144 ' 5 „2xv-(2x ) 12y) 

r = {n2ffi- (2W}] 

25 X 520—125 X 100 


= V(25X650^(25) 2 } {25x436 —(100)*} 

20_2 

“ V(25x36) 3* 

•j' Prove that a s ~g~ — a x" a ~2.r x ^ o s g v where r* r is the 
correlation coefficient between x and y. 

Hence or otherwise find r from the following data : 


BIVARIATE DISTRIBUTION, REGRESSION AND CORRELATION 275 


x 21 23 30 54 57 58 72 78 87 90 

y 60 71 72 83 110 84 100 92 113 135. 

(Madras ’44) 


For the first part see equation (2) of ex. 9 page 261. 


X 

x —54 

5* 

y y 

-100 

v 2 

>’-X 

(x-y) 


= 5 



= 1 




21 

-33 

1089 

60 

-40 

1600 

39 

1521 

23 

-31 

961 

71 

-29 

841 

48 

2304 

30 

-24 

576 

72 

-28 

784 

42 

1764 

54 

0 

0 

83 

-17 

289 

29 

841 

57 

3 

9 

110 

10 

100 

53 

2809 

58 

4 

16 

84 

-16 

256 

26 

676 

72 

18 

324 

100 

0 

0 

28 

784 

78 

24 

576 

92 

-8 

64 

14 

196 

87 

33 

1089 

113 

13 

169 

26 

676 

90 

36 

1296 

135 

35 

1225 

45 

2025 

Total 

30 

”5936 


-80 

5328 

350 

13596 





-a,* 






<J* 2 = 

) 




= 593-6-3 2 = 584-6, 

*»*-& *>)’ 

= 532-8 — 8* = 568*8. 

= 1359 *6 —(35)* 

= 1359-6-1225= 134*6. 

From the above formula, 

. _ V-f 

*• 2<t*c v 

584*64- 568-8- 134-6 
~ 2 x 24 * 178 x 23*85 
1018-8 

~2 x 24- 178 x 23-85’ 
giving r Iy =- 886. 

8 . The lines of regression for a data are given as under : 

2y—x—50=0, 3y—2x —10=0. 

Show that the regression estimate of y for x= 150 is 100 
whereas the regression estimate of x corresponding to y = 100 is 145. 
Explain the difference. 



276 


STATISTICS 


The regression lines are 

2 * 4 * 25 , 

x=iy—5. 

Putting x=150 in the first equation gives >>=100 and putting 

j=100 in the second equation gives jc= 145. 

• 

The discrepency is due to the fact that the lines are not coinci¬ 
dent. In the first we minimize the sum of the squares of the 
residuals of y while in the second, the sum of the squares of the 
residuals of x is minimized. 

Note. The choice of the line of regression of y on x and 
that of x on y should be made in a manner so that b xv .b vx < 1. 

9. For a given bivariate distribution, find the straight line for 
which the sum of the squares of the normal deviations is minimum . 

Let the equation of the required straight line be 

x cds ct+y sin <x-/>=0. ...(1 ) 

The normal deviation from any observed value x { , y { is the 
length of the perpendicular from the point (x u y { ) upon the 
straight line x cos *+y sin a-/?=0 /. e. x { cos a +>>* sin a -p. 

We have to minimize Zf (*, cos ct+yi sin a-/?)* where /, 

I 

represents the frequency of the group with x t and y t as the mid¬ 
values. Let 

u=Zfi (x ( cos a-fy, sin a—/?) 2 , 
i 

|a = 2 2 f( x < cos a+y, sin a —/>)(>>, cos a-j : t sin a)...(2) 


du 

dp 


— 2Zf (Xi cos a + sin a—/?). 
i 


• • • 



For extremum values of u, we equate 
From (3), we get 


du , du 

0a and dp Cach t0 Zer °* 


Zfi (Xt cos a +y { sin a— p) = 0 

i 


or 

or 


{cos a ZfXiFsin a Zf i y i -pZf i }=() 

Iy i t i 

x cos a-f y sin a —p = 0 9 
since ^ Zfx^g, ^ Zfy^O where N 


...(4) 

=Zf,. 


bivariate distribution, regression and 


CORRELATION 277 


Also on equating (2) to zero, we get a quadratic equation 
which shows that there are two straight lines for extremum values 
of u From equation (5), we get that both these lines pass through 
(• . ?). the mean of the distribution. From equation (2), we have 

Sf, (x, cos a +y, sin a — p) (y, cos a— x, sin a) = 0. .. .(51 


On transferring the origin to ( 2 , y) and applying (4), we get 
Y‘ [C0S a - ( *<-*)+si n MJV-JO] ly, cos cc-x, sin a] = 0. 
Putting x t - x =x/, y.-y=y/, we get 
Y‘ {X ‘ C0S x+yi ' sin *> Ky-'+y) cos a-(x/+x) sin a]=0 


or (y cos «-.? sin a) Xf (x/ cos a+y/ sin a} 

+ (cos' a-sin' a) Xfx/y,’+ sin a cos a Xf, (y/‘-x^, = o 
g |v 'ng cos 2* Sin 2a (a,” _<,,■) = (), 


since 


1 - i-i 


N^/ iX ' N jq 2fiXi'y /=cov ( x, y) = 


...( 6 ) 




I 


and ft Sfy.->=„^ 1 Xf,x;‘- 0m \ 


from'lheTea 1 ^ D ° Ud f at X ' and have taken the deviations 
trom the means s and y respectively in this case.) 

From equation (6), we get 

tan 2a= s -?£si— 

CT * — a v 2 ’ 

The above equation gives two values of a. If one is 0 , the 
other is 1+0. Hence these are two straight lines at right angles 

mhe a r Ch ,iv t cs er ,he C m arly " ^ ^ mi " imUm Value ° f “• ,hc 

alternately ^ ^ aS raaxima and mia ™ a occur 


rnr P^lly destroyed laboratory 

correlation data, the following results only are 

variance of x=9. 

Regression equationt ; 


record of 
legible: 


an analysis 


8 x- 10 y +66 = 0 , 

40x-I8y = 214. 



278 


STATISTICS 


(a) What are the mean values of x and y, (b) the S. D. of y, 
and (c) the coefficient of correlation between x and y ? 

(Agra M. Sc. ’63, I. A. S. ’47, Agra M. Com. *62, 

U. P. P. C. S. ’56) 

Since the regression lines pass through (3, y), hence on solving 
the given equations, 2 = 13, y = il. 

If we assume the first equation to be line of regression of y 
on x and second of x on y, we have 

10v=8x+66 

or >»=-8x+6-6, 

giving b yX =r-f-'S. 


The second equation gives 

40x=18>'+214 

*=•45;'+ 5* 35, 


giving 


= •36. 



r = '6, the positive sign being taken since b xv and b ve are 
both positive. 

Also c z — 3. 

Substituting the value of r and c x in the value of b xy , we get 

r-=-6x-= 45 

O v Gy 

or a y = 4. 


Thus 2 = 13, 17, r=*6, o y =4. 

Note. If we take the first equation to be the line of regres¬ 
sion of x on y and second of y on x, the value of r a conies out to 

be L° x ^ j. e . > 1 which is not admissible. 

11. Define the correlation coefficient r xy between the variables 
x, y. Show that r xy will be positive or negative according as 

G X + y > or < °X-y 

Given x=4y+5, y=kx+4 are the regression lines of x on y 
and y on x respectively, show that 0 ^ k ^ J. If k — J, find the 
means of the two variables and the coefficient of correlation between 
them. tU« P. P» C. S. ’59) 


BIVARIATE DISTRIBUTION, REGRESSION AND CORRELATION 279 


From equation (2), of Ex. 9 P. 261, we get 

CT I+ „* = CT X 2 -f <7 y 2 + 2(7* . G y . r ly , 
c z _ v 2 = c z 2 + cr y 2 — 2a z c y r zy . 

Subtracting, we get 

^x+ y a — CT z _ y 2 =4a z CT y r Zl/ . 

Since ct z and g v are positive quantities, the right hand side is 
positive or negative according as r xv is positive or negative. 

Hence , G *+i/ or <C Gx— V 

accoding as r Zy is -fve or — ve. 

We see that b Xv =4 and b vX =k. 

Since b xy .b VB =r 2 < 1, 

we i et 4A: < 1 or A < 

If A = £, we get the means y and z by solving the two 
equations. Since the lines of regression pass through the point 
(aj, y), we get 

i = 42. 

? = 9'25. 

The coefficient of correlation 

\/(.byr X b X y) — y/ A . 

12. A student has obtained the following answers to certain 
problems. Discuss and criticize them. 

(i) Mean=3, variance=5 for a binomial distribution. 

(ii) M ean = 4, variance=3 for a Poission dist ri but ion. 

(Hi) Coefficient of regression of y on x = 3'2.\ 

Coefficient of regression of x on y= "8 J •' 0r a bi var i ate 

distribution. 

(U. P. P. c. S. ’59) 

(i) For a binomial distribution, the mean is np and variance 

npq, so that np = 3, 

npq = 5 

g'vmg r/=| > 1 

which is impossible since p+ q=\. 

Hence the data are wrong. 

(ii) In a Poisson distribution variance is equal to the mean, 

ut in this case both of them arc different. Hence the data are 
wrong. 

(iii) Here 
and 


Hence 


b ye = 3-2 
b xu *='S. 

b u aX.b /v =2 56 > 1. 



280 


STATISTICS 


But b X9 xb vz =r 2 

which is never greater than unity. Hence the contradiction. 

Exercises 

1. Calculate the coefficient of correlation and obtain the lines of 
regression for the following data : 
x: l 23456789 

y: 9 8 10 12 15 13 14 16 15. 


Obtain an estimate of y which should correspond on the 
average to *=6*2. (I* A. S. 1955) 

[Ans. r=0*95; the lines of regression are 
y-12=0-95 (*-5) and *-5=0-95 (y-12). 

Required estimate is 13*14.] 




Calculate r from the following table and indicate its probable 
error : 

Net area shown in 
Lakhs of acres 

No. of ploughs 
in lakhs 

U. P. 

359 

52 

Madras 

310 

44 

Bombay 

285 

12 

Punjab 

275 

24 

B. & O. 

257 

35 

C. P. 

245 

16 

Bengal 

240 

46 

Assam 

64 

11 

Sind 

48 

3 

N. NV. F. P. 

23 

2 

Average 

211 

25 

(P. C. S. 1943) 
[Ans. r=0*0835] 

The following 

marks have been 

. • . . f* m f\ » 

obtained by a class of 


students in statistics (out of 100). 


Paper I 80. 45, 55. 56, 58, 60. 65, 68, 70, 75, 85 

Paper II 82, 56, 50, 48, 60, 62. 64, 65, 70, 74, 90 

Compute the coefficient of correlation for the above data. 
Find the lines of regression and examine the relationship. 

[Ans. r=0*918; regression lines are 
*=0*85y+9*52, y=0*99*+l ] 



BIVARIATE DISTRIBUTION, REGRESSION AND CORRELATION 281 

4. The correlation table given below shows the ages of husband 
and wife for fiftythree married couples living together on the 
census night of 1941. Calculate the coefficient of correlation 
between the ages of husband and his wife. 


Age of husbands 

15-25 

25-35 

Age of 
35-45 

wives 

45-55 

55-65 

65-75 

Total 

15- 

-25 

1 

1 

• • 

• • 

• • 

• • 

2 

25- 

-35 

! 

2 

12 

1 

• • 

• • 

• • 

15 

35- 

-45 

• • 

4 

10 

i 

• • 

• • 

15 

45- 

-55 

• • 

• • 

3 

6 

1 

• • 


55- 

-65 

• • 

l 

• • 

• • 

2 

4 

2 

8 

65- 

-75 

| • • 

• • 

• • 

• • 

1 

2 

i 

3 

Total 

3 

17 

14 

9 

6 

4 1 

53 


(I. A. S. 1950) 
[Ans. 0*91 J 

5. The following table gives the marks obtained in English 
and Mathematics by 10 students. 

Marks in English 44 42 40 52 39 32 24 46 41 50 

Marks in Maths. 24 25 28 29 32 35 36 41 45 50 

Calculate (I) Karl Pearson’s coefficient of correlation. 

(2) Spearman's rank correlation coefficient. (B.U. 1951) 

Ads. (1) 0*0808, (2) 0 0182 

6 . Find the most likely price in Bombay corresponding to the 
price of Rs. 70 at Calcutta from the following data :— 

Average price : at Calcutta 65; at Bombay 67. 

Standard deviation : at Calcutta 2*5; at Bombay 3*5. 
Coefficient of correlation is +0*8 between the two prices 
of the commodity in the two towns. 

[Ans. Rs. 72 6J 

7. The ranks of the same 16 students in two subjects A and B 
were as follows. Two numbers within brackets denote the 
ranks of the students in A and B respectively 

(1.1). (2.10), (3,3), (4,4), (5,5), (6,7), (7.2), (8.6), 




282 


STATISTICS 


(9, 8), (10, 11), (11, 15), (12, 9), (13, 14), (14, 12), (15, 16), 
(16, 13). 

Calculate the rank correlation for proficiencies of this 
group in subject A and B. 

(Agra M. Sc. ’52, Punjab, B. A. ’52) 

[Ans. 0*8] 

8. The independent random variables x and y are defined by 

f {x)—4ax, , 0 < x < r 

=0 elsewhere; 

f{y)=4by, 0 < < s 

= 0 elsewhere. 

Find the coefficient of correlation between u and v in terms 
of a and b where z/=and v=x— y. 

[ AnS * '’••“g+J* 

9. Explain why there are two lines of regression. Write down 
two regression equations that may be associated with the 
following pairs of values : 

x: 152 114 138 154 144 153 141 117 136 154 

y : 193 300 414 594 676 549 320 483 481 659 

(I. A. S. ’51) 

[Ads. x=0 03y+126*692, y=3*575x—34*67J 

10. Two lines of regression are given by x+2y— 5=0 and 
2x-h3y— 8 = 0 and variance of x=12. 

Calculate the values of 3, y, a z and r. 

(M.A. Allahabad ’52) 
[Ans. 3= 1, y=2, a* 2 =4, r=—0*866] 

11. Is the following statement correct ? Give reasons. The 

regression coefficient of x on ^ is 3*2 and that of y on x is 
0*8. (B.A. Hod’s. Delhi ’52) 

[Ans. No, since r= 1*671 which is impossible] 

12. In the following table are recorded data showing the test 
scores made by salesmen on an intelligence test and their 
weakly sales— 

Salesmen 123456789 10 

Test scores 40 70 50 60 80 50 90 40 60 60 
Sales‘000’ 2*5 6*0 4*5 5*0 4*5 2*0 5*5 3*0 4*5 3*0 

Calculate the regression line of sale on test score and 


BIVARIATE DISTRIBUTION, REGRESSION AND CORRELATION 283 

estimate the most probable weekly sales volume if a salesman 
makes a score of 70. (I. A. S. 1948) 

[Ans. ^ = 0-06x + 0'45. 

Most probable weekly sales volume is 4‘65.] 


13. Calculate the coefficient of correlation between the ages of 
husbands and wives in the following :— 


Age of 
husbands 
in years 

Age of wives in years 

■ 

Total 

10—20 

| 

| 

1 

20—30 

30—40 

40—50 

1 

i 

50—60 

15—25 

6 

3 

• • • • 

1 

.... 

• • • • 

9 

25-35 

3 

16 

10 

i 

! .... 

• • • • 

29 

35—45 

• • • • 

10 

I 

15 

7 i 

• • • • 

32 

45-55 

1 • • • • 

• • • 

7 

10 

4 

j 21 

55—65 

• • • • 

• • • 

• • • • 

4 

5 

i 9 

Total 

9 

l 29 

1 

i 32 

21 

9 

101 ) 


(Agra M.A. 1953) 





CHAPTER XII 

MULTIPLE AND PARTIAL CORRELATIONS 

12 1. Multiple Correlation. The terra multiple correlation 
refers to the theory of ccrrelation involving more than two 
variables. We have seen in the previous chapter that simple 
correlation deals with the degree of relationship between two 
variables, such as ages of husbands and wives; supply and demand 
of a commodity, height and weight, and so on. Multiple corre¬ 
lation is used to find the degree of inter-relationship among three 
or more variables. For example, the yield of a crop in a year 
may depend upon rainfall, manure, the average temperature and 
average humidity during the period between sowing and harvesting 
of the crop; the rents of houses may depend upon tax rates as well 
as upon building costs, and upon other variables also; general 
intelligence in schools may be related to grades in mathematics 
and grades in English; ano so on. Thus the aim of the theory of 
multiple correlation is to know how far the dependent variable 
is influenced by the independent variables. We shall denote the 
multiple correlation between x u the dependent variable and x 2t 
* 3 . ♦ • x n , the independent variables, by /■,. 234 ... a . 

Partial Correlation. The simple correlation between two vari¬ 
ates when the influence of other variates in them has been eliminated 
from both is called partial correlation . Thus we may measure 
the correlation between the heights and weights of boys of the 
same age, say 15 years. Here the age factor is the same for all 
the boys and weights and heights are variable factors. Again we 
may measure the correlations between the age at marriage and 
the number of children in families of the same income group. 
Here the factor income per family is constant. The correlation 
between gredes in mathematics and grades in English for higher 
secondary boys of Delhi school having the same intelligence 
quotients of 90, is also an example of partial correlation. A 
classical example is the correlation between statures of father 
and sons, when the stature of the mother has a particular value, 
say 60 inches. 

We shall denote the partial correlation between x x and 


MULTIPLE AND PARTIAL CORRELATIONS 


285 


when x 3 , x 4 , ..., x n are kept constant by r l2 . 31 ... B . We 
shall also denote the simple correlation between x x and x 2 by r l2 . 

12*2. Equation of the Regression Plane. We shall consider 
the case of three variables and will derive the equation of the 
regression plane of x x on x 2 and x 3 . We shall as>ume that the 
variables x x , x 2 , x 3 are measured from their respective means 


i.e. X x —X x =x x ; X % — X 2 =x 2 and X 3 — X a =x 3 . 

Let the equation of the regression plane of x x on x 2 and x 3 be 

X , i=a-f-^>i2.3.X a -l-6 1 3.2 .x 8 . ...(A) 

The above notation for b’s has been used for convenience 
only. Denoting the sum of squares of residuals by u, we get 

U= Z (x x —a Z>j2-3.-X‘o ^13*2 • ^ 3 )** 

where the summation extends to all sets of values of x 3 and x 2 . 

Now we have to find a’s and b’s in order that u may be mini¬ 
mum, the conditions for which are 

du' 

- = Z2(x x -a-b x2 . 3 .x % 


and 


du 


obi*. 


13-2 


^13*2 • ( n — 

...(1) 

^13-2. ( — A o) = 0 

...(2) 

^13*2*^ 3 ) ( *a) = 0 . 

...(3) 


Since Zx x = Zx 2 = Zx 3 = 0, (1) gives a= 0 and then (2) and 
(3) can be written as 

Zx x x 2 b i2 . 3 .Z x 2 2 b X3 . 2 , Z x t x 3 = 0 , .. .(4) 

£x x x a b X2 , 2 .Z x 2 x 3 —b l3 . 2 . Zx 3 2 = 0, .. .(5) 

Again since Zx { Xf=Nr { j a t af, Zx t 2 = Na? for N sets of values, 
we may write (4) and (5) as 

Nr l2 o x c 2 =Nb l2 2 o 2 3 Nb l3 . 2 .r 23 a 2 o 3 
and Nr i3 o x o 3 =Nh l2 . 3 r 23 <s 2 <j 3 +Nb l3 . 2 (j 3 * 

0f r it C l = ^l2*3 a 3 + ^13-2 r 23°i» ...(6) 

r i3 fT i~ b x2 . 3 r 23 G 2 -\-b X3 . 2 c 3 . ...(7) 

Solving ( 6 ) and (7) for b x2 . 3 and b l3 2 , wc get 


b — 

r ia 3= — 
n 2 


12 


n 


r 23 

1 


23 


r n 


1 



286 


STATISTICS 


and 


13‘2 



1 

r !2 ! 

<*1 

r 23 

r i3 

— • 

<*3 

i 

r 23 


r 28 

1 


For convenience and simplicity, we define the determinant A by 

A = 


'li 

'l3 

'l3 

r 21 

r 22 

'23 

r 31 

'32 

r 33 


If A { j be the cofactor of r t j, then 


4l=| 


' 22 

' 23 

= 

1 


'23 



r 32 

r 33 


1 '32 

1 

II 

1 

'21 

'23 


1 

'l2 

t 



'31 

'33 



'13 

1 

II 

CO 

1 

'1 

'21 

'22 


'12 

1 | 



'ai 

'32 


'l3 

'23 l 

— 

— 

1 

'13 

1 





23 


13 


13 


[V r n =r 22 =r 3 3 =l and r, 2 = r ?1 etc ]. 


^12*3— 






and _ 


G \ A 1 3 
a 3^11 


Hence the equation (A) of the regression plane of x y on x 2 
and x 3 can be written as 


Y v 

*>—o7iTi 8 ° * 


i.e. 


— zl n +— 1 * djs — 


li 


...(B) 


Similarly equations for the regression plants of x* on x x 
and x 3 , and x 3 on x t and x 2 , may be written by cyclical permu¬ 
tation of the subscripts on x and d. 

Referred to an arbitrary origin, (B) becomes 

^i"^! j i jVIT^ j i A _o 

—- j n -f- ^12+ d 13 -u. 

G • CT •> Oq 


MULTIPLE AND PARTIAL CORRELATIONS 


287 


A note on notations. For the sake of brevity, we shall denote 
the residual Xj— 6 12 . 3 x 2 — 6 13 . 2 x 3 by the symbol 6 ,. 23 . We call 
*!•>*• the residual of the second order, the order of the residual 
being the number of subscripts after the point. The quantities 
& 12 . 3 and b 13 . 2 are called the partial regression coefficients of x x on 
x 2 for fixed x 3 and of x, on x 3 for fixed x 2 respectively. The first 
subscript attached to the b ’s is the subscript of the letter on the 
left (the dependent variable) and the second subscript is that of 
x to which it is attached. We call these subscripts primary subs - 
cnpts and the subscripts separated fiom the primary subscripts by 
a point are called the secondary subscripts. A standard deviation 
denoted by a symbol with p secondary subscripts may be called a 
standard deviation of the pth order. Thus a,. „ 2 . etc. are the stan- 

ftandarH V H tl0n !- ^ °'* r ^ ""■* «=• ar <= ‘he 

standard deviations of the first order and so on. 

12*3. Multiple correlation coefficient. The multiple correla¬ 
tion coefficient of x, on x 2 and x 3 is the simple correlation 
coefficient between the observed value of x, and its estimated 
value 6 j 2 . 3 X 2 -f- 6 j 3 .gX 3 which we shall denote by e,. 23 . 

Denoting the multiple correlation of x, on x 2 and x 3 by 
^ 1 ( 23 ). we have 3 y 


R . (23) __covjj^. „)_ 

Vtvar (Xj)JV[var («,.„)]• 
Now the standard deviation of Xj is <jj. 

To find variance of «,. >3 , we proceed as follows. 
Mean of e,„-6,„.L ^+6,,.,.-^ Sx, 

2 . 3 . .^2 + 6,3. 2 . X 3 . 

var (€j, 2 j) = ^27 (<j. 23 — b lt . 3 .x 2 —b i3 . 2 .# 3 ) s 

1 


...d) 


- N £ t6i2.3^2 + 6 13 . 2 x 3 -6j 2 . 3 .x 2 -6 13 . 2 ..v 3 ) ;j 


1 


~ ^ 12 ' 3 (-^2 —*2) + 6 13 . 2 (x 3 — X 3 )}2 

= 6 12 a . 27 (x 2 -x^-F26 l2 . 3 6 13 . 2 


x N E (*a—*a> 


~ + 26,3.36,3 2 r l 2 a k o 2 -f 6, 3 . 2 2 a 3 * 



288 


STATISTICS 


^2 2 4r 13 2 -2r l2 r ia r 


I -'*, 2 


23 2 

» 


...( 2 ) 


on substituting the values of b l2 . B and b iz . 2 in terms of simple 
correlations and simplifying the result. 

Again cov (x lf E («i.ss M «i* 2 a) 

= ~ E (Xj {fr 12 -3 (X 2 x 2 ) 


■4^13*2 (*3“"^3)} 


— ^12*3 r i8 a l (7 2"4^l3*2 r i3 a l a 3 

3 2 2r 12 r 13 r 


1 -'* 3 * 


23 a 2 

•°i , 


on substituting the values of b 12 . 3 and b l3 2 . 
Hence substituting in (1), we get 

-2r l2 r, 3 r 2a y' 8 

/ A V /2 

or /? i ( 23»=v^ 1 “* 

Generally, it can be shown that 


-— 0 - o-- 

» ^i2 2 +r 13 2 -2r 1 

* ,(23) 


...(3) 

...(4) 




where 


A=[ r n r 12 


r 2l r 22 


r nl r n2 


Iri 


2 n 


fin 


Since /\ 1(2 3 ... n) is a simple correlation coefficient between x x 
and e^ag. it must lie between—1 and 1. But we have seen 
above that the covariance between x t and c V23 .. , n is the same as 
the variance of€ 1 . 23 ,.. n and hence must be a non-negative 
quantity. Thus we always have 

0 ^ ^ 1 ( 23 * • *n) ^ 1* 

Clearly the multiple correlation is perfect if A =0. 

12*4. Partial correlation coefficient. As already defined, the 
partial correlation between x x and x a in the trivariate distribu¬ 
tion of x„ x a , x 3 is the simple correlation between x x and x, 
when the influence of x 3 is eliminated. Hence we subtract from 
the Xj of each point that part of x x which is due to the influence 





MULTIPLE AND PARTIAL CORRELATIONS 


289 


of x s , as indicated by the regression of x t on x 3 and denoting 
the residual by x vit we get 


• X%. 


13 


Similarly, 


*2-3 = *2 — r 23 *3« 

<*3 


The partial correlation coefficient of x x and x 2 in the trivari¬ 
ate distribution is defined as the ordinary correlation coefficient 
of * 1# 3 and x 2 . 8 and may be denoted by r l2 . 3 . 

cov (x,. a , x 2 3 ) 


Thus r l2 . 8 = 


Vlvar (^.alJVtvar^.a)]* 


. ( 1 ) 


Now 


1 


(X|. 3> x 2 . 3 )— jy £ (x 1<3 JJj.a) (-^g-a S 2 . 3 ) 


S {( Xl ” ri3 a \ X 8 *> + r » 3 a ‘ * 3 ) 


X f*2 ^23* -*3 ^2"t"^23 ~ ^3^ / 

\ a 3 a 3 / I 


— ft E r !3 ^ (*3“ '’’ 3 )^ 


X^(X 2 * 2 ) ^23 — (X 3 


— A I ^ ( X 1 (*> ^ 2 ) r 'i3 a ft ^ (■^l“^l) (-^3 ^ 3 ) 


N 

“’^3 ~x> ^ (-*3 ®a)4"^i3 ,, 23 Jsii/ £ (x 3 — £. t )‘ 


o 3 2 N 


r !2 a l a 2 — ^t3 • ~ - r !3 a l ff 3 


U1 G 2 

”" r i3 “ * r 23 CT 2^3 + r i3 r 23 ~J • 


12 _ J 


3 


<*3 


e= °l a a (^12"" r iB r 23)» 
var (X|. a ) = ^ r £ £ 8 .a)* 


jy E ^(x A —£ A ) r 13 ^ (x 3 x‘ a )J 


— a/ ^ (*i“*i) 2 — r ia a * ( x i - v i) (•*» * 3 ) 


*f yy • r l3* • 0 * 2 ~ (x 3 —j? 8 )* 


290 


STATISTICS 


— Gi 2 —2r 13 — r x3 o x a 3 -\-r z xa . 

CT 3 


°i* a 2 


O 


— °i 0 ^la")* 

Similarly var (at 2 . 3 )-=<t 2 2 (1—'■ 33 2 )- 
Hence r,..,=- a ' a - lr »~ r ^ 


12 ’ 3 0l V(l-'- a . 3 )<’2V'(l-!■’«) 

r 12 "" r l3 r 23 


12 


[*nW r 


...( 2 ) 


00 

where 

(Hi) 


“1(1-0 (l-ra*)] 17 *- 

12*5. Solved Examples. 

' 1. With usual notations, prove that 

(i) Ex 2 X x . 23 = z Ex 3 X x . 23 = 0 t 

•‘£ , x 3 x 2 .i 3 = 27ajX2m & = O t 

and E x x x 3 . x2 == Ex 2 X 3 . x2 = 0 , < ■ 

2? i*2=2 ^x 1 . 23 x 1 , 

2.X 1 .o 3 X 1 . 2 3 = 2] , X 1 . 2 3.Y 1 , 

Xj.2 = Xj b x2 X2* 

Ex 3 . 2 Xj. 23 =£x s . 3 X x . 2 3 = 0 . 

(i) The equation of the regression plane of x x on x 2 and x 3 

is *1 = ^ , 12«3^2'1 - ^13 , 2^3’ 

The normal equations for finding ft’s are 

E X 2 (Xj ft^n . 3 X 2 — ^ 13 - 2 X 3 ) = E XoXj-23 — 0 

and 27x 3 (Xj b X2 . 3 x 2 fti 3 .2X 3 )=27x 3 x 1 .23 —0 

Similarly the other two results can be at once written by 
writing down the normal equations for the regression planes of 
x 2 on x 3 and x, and of x 3 on x x and x 2 respectively. 

These results show that the sum of the products of corresponding 
values of a variate and a residual is zero provided the subscript of the 
variate occurs among the secondary subscripts of the residual . 

(ii) We have ^Xj^gXj.o—^TXj.og (Xj — ftjo^s) 

= 27x l Xj.23““ftj 2 .27X2 Xj«23 

=Ex l x vt3 from (i). 

and 2 ’Xi.o 3 .x 1 . 2 3 = 2 , x 1 . 2 3 (x x ~~b x2 . 3 x 2 —b x3 ^x 3 ) 

^ x x x x . 23 —b x o, 3 , Ex 2 x x . 23 b x3 . 2 ,Ex 3 x x . 23 
=Ex x x x . 23 from (i). 

It follows that the sum of the products of two residuals is un¬ 
altered by omitting from one residual any or all of the secondary 
subscripts which are common to both. 

(iii) We have £x 3 . 2 . x V23 =E (x 3 -ft 32 x 2 ) x^ 

=£x 3 x 1 . 23 b 32 . Ex 3 x x . 23 =0 from (i). 


MULTIPLE AND PARTIAL CORRELATIONS 


291 


Similarly, Z , x z . 3 x 1 . 23 =0. 

Hence the sum of the products of residuals is zero , provided 
all the subscripts of one residual occur among the secondary subs¬ 
cripts of the second. 

2. Prove the following results : 

(i) cov (x 2 , x vt3 )=cov (x 3 , x vtt ) = 0, 

(ii) cov (e v2 3 , X x z3 ) = 0, 

(Hi) var (x V23 )=o 1 2 [1-P^] 

*2f 

c* 11 

bince x lt x 2 , x 3 are the deviations from their respective means 
we 6 et Xi^StrX^O, 

€ l , 23 ==e 2 > 81 = € 3 12 = 0 

and ** —_ vj __a 

■*1*23 ^2-31 —' V 3*12 — 

We then have 

(i) cov (* 2 . x x . i3 )=E (x t ) ( Xl . 23 ) 

= fj ^ X 2 ^12*3*2 "“^13*2*3) 

_1 1 ] 

“N 2Jx ' x *~-N b n-* Sx *—f r b i3 e.2x a x 9 

= /- 12 .a 1 cr.,-6 12 . 3 af-b l3 . 2 r. i3 o,a 3 
- a i ( r n-r l9 r ?3 ) 

ct 2 (1 r 23 “) 


= '12 • «i a* ■- . a * 


a l r i3 ~ r i ? r j:i 

I j * r 23* a 2 a 3 

1 '*3 


<?3 


a l°2 


1— r 2a 2 ^ 12 r i2 r 23 a “' , is +r u r 23 

_^ 4* fi-d'-js] 

Similarly, cov (x ^ Xi . 23 ) = 0 . 

<U) COV ( € 1 23, X vt3 )*=E ( €l . 2S ) (X..23) 

^ N ^ 3Ar 2 + ^13-2^ 3 ) (-Y, 23 )} 

_^I3 3 jn , ^13*2 r. . „ 

— jy ^^1:3 + -^ 23 

= 0 from (I). 

AUo ^ r ° m ** ^ ,ows t,,at € i *23 and x v23 are uncorrelated. 

I. follows that 

var (x l ) = var var (e 1# . a ) 


292 


STATISTICS 


or 


var (x 1 .s 3 )=var fo)—var (e A 28 ) 

2 r i2 2 + r i3 a '“2r l2 r 18 r2 3 2 

• “ a i-1 _ } 23 t a i 

[see (2) of § 12 3] 
-a! 1 [1-**H23>] [see (3) of § 12*3] 

o J 


from (4) of § 12*3. 


u 


It follows from above that the standard error of the estimate 
is a 1 v'(l—^ I K 23 )) and may be denoted bycr 128 . Clearly larger 
the value of the multiple correlation, the smaller is the standard 
error of the estimate. If i? 1(23) = 1, a,. S3 *=0 and so in this case the 
observed and predicted values coincide. 

3. Define multiple and partial correlation coefficients. 

For a trivariate distribution t show that 

1 — R 2 u 23 )—( 1 — r x <?) (1 r 2 13 . 3 ^. 

Deduce that R u23 ) > r l2 . 

For definitions, see § 12 3 and 12*4. 

From § 12*3, we have 


1 ” R* i foa) ~~ . • 


1(23) 

And from § 12*4, we get 

1 — / ’ a 1 3*8 == 1 

Dividing (1) by (2), we get 

1 ~ R\(n) ^^33 

1—r 2 


a 


A 2 

"13 


J 


11 ^33 




...( 2 ) 




13*2 "ll"83~ " 13 

(f— r i2 2 ) [U ~ r »a a )-~ r ig (^i2~" r 9* r i3) 4-r 13 (r 1 2r 23 ~r lfl )3 
“ (1-V) (l“''l2 , ')-^12 r «3~' , is) 2 


__ (l~ r2 ia] [l— r a3 I ~~~ r i2 2 4-r 1 3r^ 3 r 13 4-/ , 1 2r a3 r)3-~r| 3 2 ] 
[ 1 ~~ r 23 i ~~ r i2 a + r iz* r i* — r i2 a, ’23 a — / 'i3 a “{‘ 2r aa r #3 r 13 ] 


= l-rV 

Hence 1—^*i( 23 ) 5 S= (l“^ 2 i 8 »«) (1 ^i 2 a )» 

This proves the first result. 

Again, since 0 < r l8 * < 1, we have 

0< l-r ia *< 1. 

Similarly 0 < 1—r* l8 . 2 < I. 

Then it follows from (3) that 

1—i? s i (2 3) ^ 1 — r ia** 

i,e. Rj (go) ^ *12 


...(3) 


...(4) 



MULTIPLE AND PARTIAL CORRELATIONS 


293 


and 

i.e. 


1-**1(131 < l—r* 


13-2* 


*1(23) ^ r l3 2* * * 

The results (4) and (5) show that the multiple correlation 
coefficient can never be smaller than the numerical value of any 
total correlation or partial correlation coefficient . 

Note. Since o x * (l-r l2 *) is the residual variance when x x is 
estimated from x 3 and oj* (1 — r x . 33 *) is the residual variance 
when x x is estimated from x 2 and x 3 taken together, it follows 
that as a result of including an additional variable x 3t the residual 
variance is reduced. Now it is worth while to include x 3 only 
when the reduction in the residual variance is substantial. This 
will be so if the numerical value of r 13 . 2 is sufficiently large. This 
shows the importance of partial correlation in deciding whether 
an additional independent variable is to be included in regression 
analysis. 

4. The necessary and sufficient condition that the three 
regression planes in the case of trivariate distribution coincide is 

^23* "1" rji* ”{- r*j2* 2^23^31^18 == /• J 

The equations to the three regression planes are 


f , A.+£j..+£'A.-o. 

(7| Qg <J$ 

...(1) 

^,+-* 

° 1 CT 2 a 3 

...(2) 

and * x A 3l +** J 32 + Xj 4 33 =0. 

a 2 a 3 

...(3) 

Now the planes (1) and (2) will coincide Jif and 
coefficients are proportional, i.e. if 

only if the 

J n d 12 d 13 

A 3 l J 22 A‘ i3 

Similarly the planes (2) and (3) will coincide if 

...(4) 

A'i\ A 22_ A 3 3 

A 3 *» A 3 

...(5) 


From “-= 


'21 


1? 
J 2 j' 


we get 


or 

or 

or 


dfl^22—^21^12 — 0 
0 ~ r 23 2 ) (1 — r 3l 2 ) f r i2 ,# 23 r 3l) 2== 

1 *23* r 3l* + r 23 2/ ’ 31 2 — r i2 2 r 2^ r $\~ + ^ r 23 r 9i r i2 = 0 

r *3* d" r 3i a + r i2 2 — 2r2 3 r 31 r I3 = 1 


294 


STATISTICS 


and from we get 

A 22 ^23 

^12^23 — ^13^22 = 0 

or ( r ia~ r 23 r 3i) ( r 23— r 3i r i2)+(r 3 |—r 28 r l2 ) (1—^3i 2 )=0, 
i e . r l2 r 23 —r l2 2 r 3l *» r ia^si 2 — r iz r 2 s +r 12 J*aj r 3i 2 4* r 3 i —/' 31 s =0, 

1 e • r 3i .( r 23 2 4 -r 3i 2 4' , ’i2 2 —^Tasrai^ia— 1)=0 

or r 23 2 +''ai 2 +^i2 a —2r 28 r 31 r ia =l f 

which is the same as before. 


Similarly from any pair of (5), the same result will follow. 
Hence the result. 

5. (a) If r 23 =0, prove that R\ {23) =r l2 * + r l3 * 

and ff i*23 2= ^ ^i2 2 — r i 3 *)- 

(b) If r.^—1, prove that 

-^ 2 1(23) ==r i2 S==r i3 2 

and <7 1 . 23 2 =c 1 2 (l—r 12 *). 


(a) 

% 

Now R UiS) *-- 

-(-£) 

and when /* 23 =0, we have 


II 

1 r i2 

'13 




f n\ 1 

0 




'si ^ 

1 




l~ r i2 2 ” r 3 1 2 

\ 


and 

^11 = 

l_r 23 2 =l. 




.’. ^1(23)'”' 

= l-(l-r l2 2 

r 31 Z ) = 

'ia , + 'si* 

and 

23 2=C7 1 2 

11 -^1 (23) 2 J 

= a, 2 (l- 

~ r i2* r 3l 2 )- 


(b) We have from (3) of § J2*3, 

-^ 2 i ( 23 ) (1 r 23'’) — r i2''~^ r i3~ — 12 r 13 r 23 . 

Putting r 23 = I in the above equation, we get 

( r i2-'’i 3 ) 8 = 0 * *•<?• r i2=r 13 . 

A so when r l2 —r l3 , we get 

O 2 _ zr l 2 z 'l2 '23 

1(231 “ 1=77 ,*— 

_ 2r >*l 

l+r 23 - 



• • • 



Hence if we now put r 23 = 1, (2) gives 

^ 2 l(23) ==r i2 2 » 

Then (1) gives 7\~- li23 \~r- l2 ~=r 13 ~, 

And from Q. 2 (iii), we have 

Var. (^1.23 ) = ^i 2 [1 ^*i(23i]» 

i e o“i- 23 =°i 2 [1— r ia *j from (3). 




MULTIPLE AND PARTIAL CORRELATIONS 


295 


6. Ifx lt x«andx z satisfy the relation a x x x +a 2 x 2 \-a 3 x z —k, 
prove that 


r i 2 — 


a fef—a i*gi a —q ? 2g ? 2 
2 a x a 2 o x o 2 


with two similar expressions for r 23 and r 31 . 

Also prove that all the partial quotients are equ ll to — / , pro¬ 
vided that a x . a>, a 3 are positive. 

Since x l% x 2 and x 3 are the deviations from the mean, we have 

_ . ri O • n.. A 


i.e. 


Since x lt x 2 and x 3 

2 ?k — a x 2 ?x x -j-u 2 

k= 0. 

Hence the given relation becomes 

a i x x UoXo “I - a 3 x 3 = 0. 
Zx.x* 

Now --— 


= 0, 


fl2 Ng x o £ ' 

a l x l -}~ a .£ x 2 


...( 1 ) 

...( 2 ) 


= — a 3 x 3 . 


or 


or 


From (1), . 

Squaring, a^xf+afxf -f- 2 a l a,x 1 x 2 =a 3 i xf. 

Hence ^ af2xf+jj + /y a \ a ^x x x n — ~ a.yllxy 

Q\ 0 \ + of of + ^ a x a.Xx x x 2 = afof 

1 _ afaf-otW- 0 ** 0 * 


N 


Sx x x 2 = 


2 a x a 2 


Snbstituting in (2), we get 

_ afof — a x *a i 2 — afrj.f 

12 2ja x a 2 o x o 2 

Also by symmetry, we have 

afof — afof — afaf 

r ^’~ 2a 2 Q 3 a,<3 3 

a z o i i — a 3 a 3 -a x -a l t 


.(3) 


Again 


31 2 a 3 a x n 3 n x 

r I2~" r i3 r 23_ 

”* 3= n i—*■.,*> (i— /B 

of of —o l a g | g — of of 

2£/|£7 2 (7|U., _____ 

VIA AafofZfoi* I l Aafafafof (} 

^ a fof (ofof — fl i a,7 | 2 — ofaf) 

-(afnf-afof-af*f) (of of-of of-af of) 
V[{4ofafofof-(afof-afof-afof)*i , 

x { 4 afafofof — (a fa f — afaf — afaf)} 1 



296 


STATISTICS 


Numerator 

=a 1 *o 1 i +a 2 *G./+a s 4 (j 3 *—2a 1 *a 2 2 o 1 t G 2 2 —2a 2 ? a 3 *G 2 2 G 3 ' 2 —2a 3 z ci 1 z cr a 2 a l t 
= +0 2 W— a 9°3*)* —4fl 1 2 fl 2 ? CT 1 *cr 2 2 

= (fljVj* -f fl 2 2 <T 2 2 —u a 2 °s 2 — 2a x a t o x o 2 ) 

X (V<7i a + a2*<?2 2 — o 3 2 <V +2o 1 a 2 <7 1 o- 2 ) 
^[(^i^-^a) 2 —fl 3 2 V] [( fl i CT i + a 2 CT 2 ) ,—a 3 2(T 3 3 ] 

— ( fl l CT l — a 2<*2 + O 3 O 3 ) ( a i a l ~ a 2 a 2 ~ a a°3) 

X (a x o x -f a 2 c 2 -f o 3 a 3 ) [a x G x +a 2 G t -^ o^g 3 ). 

Denominator 

= V{(2^3 a i CT 3 CT l + OgW — ^3*^3 a — °1 2<7 1 2 ) 

X (2u 3 a,<j 3 (T 1 —o 2 2 g 2 *+ a s 2 a 3 * + a x 2 a x z ) 

x (2a 2 a 3 G 2 o 3 -f- A^a^-floV—o 3 ?(7 3 2 ) 
X (2o 2 a 3 <T 2 CT 3 — a x *of + a 2 o 2 * + o 3 ?(I 3 8 )} 

— V[{^2 2 ^ 2 2 ~ (<*3 G d — ^l^l) 2 }{(^3 ff 8+ °l^l ) 2 ~ A**** 1 } 

X {fli’ffi* — (o 2 a 2 — a a a 8 ) 2 } {(a 3 G 2 + a 3 G 3 ) 2 — V^i 2 }) 

= *v/[(^2 + ^a <J 8—(fl*cra-«3ff8+«i®i) (^ 3 a 3 + fl i <T i+^ 2 ) 
x ( J 3 CT 3+ fl i a i—^( fl i®i+ <, * cr a“" fl a < r 3 ) (Oi<*i—0 2 a 2+ fl 3 CT 3 ) 

X (0 2 <*2+ fl 3 ff s + fl l a l) (^ 2 ^ 2 +^3^3— Ol^)] 
= (a2^ 2 + fl3^3'- fl i ar l) (02^2 — ^3 + 01^) (OgCg + Ojff, + a 2 <T 2 ) 

X OjOj — o 2 ct 2 ) # 

Hence r 12 . 3 = —1. 

Similarly other partial correlation coefficients can be proved 
equal to —1. 

7. If x x% x 2 and x 3 are three variates measured from their 
respective means as origin and if e x is the expected value of x x for 
given values of x 2 and x 3 from the linear regression of x x on x 2 and 
x 3 , prove that 

cov (x x , e x ) = var (e x ) = var (x x ) — var (x x ~ e x ), 

(I. S. I. Calcutta 1956) 


Since ^+^.03 
cov (x xt e x ) 


x Xt we have 
cov (x Xt x x —x x . i3 ) 

=£• {x x (x x -x x 23 )} 

Jy ^ X 1 (*l””*l , *3) 

27 ATj 3 — jy 2 X x x x .23 

jyr ^ *1 * —• 27 (^l + ^l *2 3 ) •^i»23 


-V 


^ Xl “~N ^ ^ l *** 23 A> ^ X% \ 23 


N 


MULTIPLE AND PARTIAL CORRELATION 


297 


% *i 2 — jy 2 * 2 i*23 


[V E e x x x . 23 = 0 from Q. 2 (ii)J 


= var (x x ) — var (x x 23 ) 

= var (xj —var (x 1 — e l ). 


...d) 


And 


var C ' = N E (*i— x i- 2 a 


) 2 


27 <? x x,. 2 3 = 0 as beforcj 


( 2 ) 


^ r *** + JV r * 2 ' s *i x i =3 

= E X \ 2 ~^-Jj £ x 2 x . 23 — 27 (p x “f*Xj.23^ (Xj. 23 ) 

2 *'*" * r x, ‘“' 

{••• 5 

= var (Xj) —var (Xj^g) 

= var (x x ) —var (X t — <?,). 

Hence from (1) and (2), it follows that 

cov (x x , f 1 ) = var (e x ) = var (x x ) —var (x x - e t ). 

8. (i) If r 12 and r l3 are given, show that r 2 * must lie in the 
range r l2 r l3 ± (J-r v 2 — r 13 s +r 12 *r l3 2 // 2 . 

(Delhi M. A. ’54, Calcutta B. A. Hons. ’54, Delhi I. C. A. R. ’54) 

(it) If r l2 =k, r 22 -=t — k t show that r 13 will lie between —1 and 
l-2k*. (Calcutta I. S. I. ’53) 

(i) Since r* 12 . 3 < 1, we pet 

('l2 'l3 r 2 3 )* ^ 0—'l3*) (1— r 23 X ) 

'i2*"b'i3*'2 3 2 2r 12 r 13 r 23 ^ 1—r 13 2 — r 23 * -f-r I3 *r 23 2 

r 2 3 2 2r 12 r 13 r 23 -l-r 1 2 a 4-r 1 3 2 — 1 ^ 0 

< ?rj 3 r |3 + y/(4 r 1 2 2 r, 3 * — 4r |2 a — 4 r, 3 2 4- 4) 


or 

or 


or 

or 


23 


or 


or 


r 23 < r | 2 r ia ±(l — r 12 2 —r 13 *-fr 12 *r I3 *)* ,2 . 
(n) Proceeding as in (1), we get 

r i 3 * 2r| 2 rj 3 r 23 -f r 23 2 -fr l2 2 —1 ^ (\ 
k > we get 

r la 2 + 2*V 13 + 2**-l < 0 

— 2k 2 ± y/(4k* — 8k* + 4) 


Putting 


12 


13 


< 


'n < -k‘±(k*-l), 
1 e • 'i 3 lies between —1 and 1 —2k 2 . 



298 


STATISTICS 


9. A number of persons are measured for their heights (x), 
weights (y) and chest expansions (z) and product moment correlation 
coefficients are calculated. Prove that 


r x v ~\~risA' r z* ^ “*§• 


(Delhi M. A. 1956) 


Since 


(~ *+Z+-*] 

L a y a z J 


> 0, we get 


£ (* 1 +yL+i1+ 2 ^z+ 2 yj+m) > 0 . 

1 ~ " " " ~ ~ o a axj 


...d) 






Oy<J B 


Now 


E(x*)=%=l. 

o x - 


Similarly 


Also 


(SH 

£ ( ! ‘>V 


\o*o y J a x a u 




1 »r X p»ox&y — 2rjyi 

<7a;C w 


Similarly 


( 2>z )= 

\o v c e J 


2 r 


M 


and 


£( 2 -)=2 r„ 

\O b O x J 


Hence (1) gives 

1 -f1 +1 +2/'* 0 +2/‘» s +2r Ja . ^ 0 
or r xv ■+• r vs -f- r sX ^ 

10. If x l9 x 2 and x s are three variates measured from their 
respective means as origin and of equal variances, find the coeffi¬ 
cient of correlation between ^ and (x 2 +x 3 ) in terms of r l2 , 

r, 3 and r 23 and show that it is equal to 

(i) if '■i 3 = '* 23 = 0, or (ii) if r 13 =r 23 = /. 

(Calcutta I. S. I. ’56) 

We have 

_ E (*,_+*$) (*a+*s)_ 

r (x 1 +AT 3 ) (* a +x 3 ) — y'jvar (x & +JC,)J Vtvar (x 2 +x 3 )f 

Now E (Xi+Xa) (x 2 + x 3 ) 

= E (x t x 2 + x 2 x 3 -f x 3 2 ) 

= r i2< J l (J 2 + r i3<*l<*3 r 23 CT 2 CT 3"i" 

= C 1 3 ( r i8 + r i3 + r 28+l) SitlCe CTi = CJ 2 =ff 3 . 

Also var (*i-fx 3 )=var (x^+var (x 3 )-f 2 cov (x x x 3 ) 

=of + c 3 2 + 2a,e 3 r 13 

=2of (l + r 13 ). 


MULTIPLE AND PARTIAL CORRELATIONS 


299 


Similarly var (x a +x 3 ) = <7 2 2 -!-CT3 2 -f 2 ct 2 ct 3 r 23 

=2<x 1 2 (l+r 23 ). 

u <** (1 -4-r l2 +r 13 + r 23 )_ 

Hence ''( Xl +* 3 ) {X2 + X3 ) — 2 {(l-fr 13 ) (l+r 23 )} 1/s 

_ ^ 4 ~ r i a 4 -r 13 -f ~ / '23 

2 {(l+r 13 ) (l + r,3)} l/a “ 

(i) If r 13 =r 23 =0, we get 

1 *4~ ?l2 

r (*l+Jf 3 ) (*l + * 3 ) = 2 

and (ii) if r 13 =r 23 — 1, we get 

_ 3 4 - r ia 

r (*i+* 3 ) (*,+x 3 ) — 4 • 

11. The three variates x lt x 2 , x 3 are measured from their 
means. Oi — 1 ; o 2 =13 ; <r 3 — /*9 ; r lt =0 370 ; r x ^=—0641 ; 

r 23 = — 0'736. Calculate r 13 . 2 . //* x 1 = x 1 -f-x 2 , obtain r 42 , r 43 

r 43 . 2 . Verify that the two partial correlation coefficients are equal , 
and explain the result. 

We have 


13 * 2 fO -/+/> (i ~' W : )] 1/2 

— 0*641—(0*370) ( — 0*736) 
_ [{l-(0*370) 2 } {l-(-0*736) 2 jp 

—0*641 -fO*27332 
r= {(0*8631) (0*4583)5 1/2 


— 0•36868 
== ~0 t 6289 


-0*586. 


Again since x 4 =x 4 + x 2 and x lt x 2 are correlated, we have 

a 4 * = a 4 2 + a./'+ 2 f?! ^ 2 2 

= 1 + 1*69 + 2 (1) (1 *3) (0*370) 

= 3*652 


o 4 = 1 * 91 • 

Also cov (x 4 , x a )=cov (Xj + X 2 , x 2 ) 

= E {(Xj + .Y-j) x 2 } 

~E (XjX 2 ) + /T (x a a ) 

«=/' 12 ct 1 ct 2 +'*2 2 

— (r l2 7 1 + '7 2 )*= 1 *3 [0*370+ 1 *3] 

=2*171. 


or 



300 


STATISTICS 


Hence 


*42 = 


cov (at 4 , x 2 ) 2*171 


2*483 


C i a 2 

171 =0*874. 


(1-91) (1*3) 


Again cov (* 4 , **)={£ (C^+Xa).:^} 

=E (x^J+E (x 2 x a ) 

E=r ia <r l° r 3"^ r 23 CT 2 tT 3 

=—0*641 (1) (l*9) + (—0*736) (1*3) (1*9) 
= -1*2179-1*81792 
= -3*03582. 


*-43 = 


_cov (x it x a ) 


CT 4 ff 3 


—3*03582 
(1*91) (1*9) 


-3*03582 

3*629 


= -0*836. 


Hence 


r 43*2 — 


*48 r €t r 23 


(U (I-**, 2 )} 1 '* 

-0*836-0*874 (-0*736) 


'[{1—(0*874>*>{1 —(—0 736> 2 }J l/a 
—0*8364-0*6432 64 
: {(0* 236 J) (0*4583)}*** 

0*192736 0*192736 

(0* 10820463) 113 0*329 


= —0*586 which is the same as r l3 . 3 . 

12. The following means, standard deviations and correlations 
are found for 

X-y —seed-hay crops in cwts. per acre , 

X 2 =spring rainfall in inches , 

X a —accumulated temperature above 42°F in spring , 
in a certain district of England during twenty years. 


X l =28’02, CTj=4* 42, r l2 = +0-80, 

X 2 =4'91, <t 2 = 7* 10, r 13 = -0'40, 

X 3 =594, a 3 =85 t r 23 =— O’56. 

Find the partial correlations and the regression equations for 
hay-crop on spring rainfall and accumulated temperature . 

(Lucknow B. Sc. ’46) 

We have 

r — r \2-~ r \3 r w 


MULTIPLE AND PARTIAL CORRELATIONS 


301 


0-80—(—0*40) (—0*56)_ 

= [{1 — (— 0 * 40 )*} (1 — (— 0 - 56 j *}] ,/2 
0*80 — 0*224 0*576 

= [( 0 * 84 ) ( 0 T 6864 ) J 1 ' 2 ~ 0*759 

= +0*759, 

_ r i3 — r 23^12_ 

ri3,z "{FO(i-6 2 I )} 1/2 

—0*40 —(— 0*56) (0*80) Q 09? 

= 0 * 56m 1 - (10 • 80)*} ]> 


r 9* — r \ 9 r i* 

rj >‘ _ {rr^T ia 2 ) o-'-i! 2 ) 1 ' 1 

— 0*56—(—0’40) (4-0-80) 


= -0-436. 


= Qi_ t —0 40)*} {l-lOWJ] 1 ' 2 
Again 4 u =i-r ss »=l-(-0-56) 2 =0-6864, 

-4i2 — r ia r 23 — r i2 == t — 0*40) ( 0*56) 0*80 

= —0-576, 

^a=C—0 56) (0-S0) —(—0-40) 

= -0-048. 

Hence the regression equation for hay crop on spring rain¬ 
fall and accumulated temperature is 




»7 \ 


{X l -x l )+‘^ < X,-X,)+^ (jr a -jf 3 )=0, 
(^- 28 02 ) + ( ~°| S o 76) {X '~ A 9,) 


+ 


(-0-048) 

85*00 


(2f 8 -594)=0 


or 

or 


* 28*02=3*37 (2^2—4 91)+ 0*00364 <2f a -594) 

^ = 9*31+ 3*37^ + 0-00364^3. 

13. On the basis of observations made on 35 cotton plants , 
the total correlations of yield of cotton (x x ), number of balls, i. e. 
seed-vessels, (x 2 ) and height fx 3 ) are found to be 

r lt *=0-863, r l3 *=*0'648 and r 23 =0'709. 

Determine the multiple correlation R ll23 , and the partial 
correlation r l2 . 3 and r 13 . 2 and interpret your results . 

We have 

_pi a *+r 1 3*-2r 12 ri3r L 3'l 1/2 

L 1—/v,* J 




(23) 


r(0-863)*-MO-648) 2 — 2 (0 863) (0 648) (0*709) 
L T- (0"709> 2 

0"865, on calculations. 


J 



302 


STATISTICS 


r i . 2» 3 [(1 — r ! 3 *) ( l — r 23 *)] 1/2 

0*863—(0*648) (0*709) 
== [{l-(0*648) 2 } {1 — (0*709> 2 }] 1/a 
=0*751 


0 , __ r 13’~~ r 23 r 1g 

and ^••-[(l-r 23 »)(l-f )! *)]‘« 

0*648—(0*709) (0*863) 

~ [{1 — (0*709) 2 } {1—(0*863) a j] 1/2 
= 0 * 101 . 

Interpretation. Since -K, (23 ) is very large, it follows that x 2 
and y 3 have considerable influence on x v In other words, for 
given values of y 2 and x 3t the predicted value of x t obtained from 
regression equation of x 1 on x 2 and x 3 will be excellent. 

Again since the total correlation r 12 is quite large, it is 
desirable to take y 2 as an independent variable for predicting x l# 
The partial correlation r 13 . 2 , being equal to 0*101, shows that the 
inclusion of x 3 as an independent variable, in addition to x 2 
would considerably increase the accuracy of prediction. 

14. Is it possible to get the following from a set of 

experimental data ?— 

(a) r 23 =0*S, r 31 = —0*5, / 12 =0*d. 

(b) r 23 =0*7,/* 81 =—0*4, r ia =0*6. (Agra B. Sc.’61) 


Here 


_ — r i3 r 23_ _ 

,„•> (1 -/•«-)}>« 

_ 0*6+0*5x0*8 

{(1-0*25) (1—U*64)}*' 2 


_ 1 

_ ' V / ( u * 75x0 ’ 36 l 

100 _ 100 _lO-v/3 

V(75x36) 30 V3 9 
which is not possible. 




06-f 0*4x 0*7 

(b) r i2*3- {(1 _ 0 . 16) (1 _ 0 -49)} l/a 

0*88 _ 

~V( 0,84x0 ' 51 ) > * 

which again is impossible. 

15. If x^yt+y* X 2 =}’ 2 +y 3 , x 3 =y 3 +y lt where y lt y 2 , y 3 
are uncorrelated variables each of which has zero mean and unit 
standard deviation, find the multiple correlation coefficient between 
x v and the two variables x 2t x 3 . (Delhi M. A. 1961) 


MULTIPLE AND PARTIAL CORRELATION 


303 


We shall first of all find r 12 , r 23 and r 81 . 

We have 

cov (x 1# x 2 ) 

r id — • 

C \ a 2. 

Now cov (x lt x 2 ) = E (x x x 2 ) [V y l9 y 2 , y 3 have zero means 

x lf Xo, x 3 will also have zero means] 

= E {y x +y 2 ) (T2+.V3) 

=£ (yiy*)+E (^ 1 +E (y 2 y 3 ) 

= 1. [V >’it >* 2 » ^3 are uncorrelated, 

e (y\>’z)= E (yzy 3 ) = E O^i^o 

Also E (>V) = 1.] 

Aeain a x z = E (x x *) = E {y x +y 2 Y 

== £(y 1 *)-f2E iy x y 2 )+E (jy 8 ) 

= l + 0 +l= 2 . 

Similarly, c 2 z =2. 

Hence r i 2 = i* 

Similarly r 2i ^=r 2l ^h» 

= i on simplification. 

I * A 1 I 

and ^ii = | 1 i =i- 

I h 1 

Hence * 1(23 , = \/ ( 1 ~ ^) = \/ (‘ ~ 3) 

1 

= V3* 

Exercises 

1. Three variables have in pairs simple correlation coefficients 
given by 

r 12 =0-8, r 13 =-0 7, r 23 =-0-9. 

Find the multiple correlation coefficient < 23 * °f on x 2 
and x 3 . Ans. 7? 1(23) = 080. 

2. Suppose a computer has found, for a given set of values 
of Xj, x 2 and x a , 

r 12 =0*91, r 13 = 0*33 and r 23 = 0*81. 

Examine whether his computations may be said to be free 
fr$m error. Ans. No. 




304 


STATISTICS 


3. In a study of the factors with influence “academic success”, 
may obtained the following results (among others) based on 
the records of 450 students at Syracuse University :— 

X x — honor points * a =general intelligence * 3 =hours of study 

~X x = 18 5 * a =100*6 *3=24 

a x — 11*2 o a =15*8 o - 3 =6 

r l2 =0 , 60 r 13 =0*32 r 2 a =—0*35. 

Find to what extent honor points were related to general 
intelligence, when hours of study (per week) are held 
constant. 

Also And other partial coeflicients. 

Ans. r, 3 . 3 = 0 *80; rj 3 . 2 —0*71, r 33 .i= —0*72] 

4. Calculate the multiple correlation coefficient of *, on X 2 
and *3 from the following data :— 






Find also the regression equation of X x on X 2 and X 3 


[Hint. ATj-3-714, *3=2*714, * 3 = 19*86, 

0 ,=4*489, c a 8 =0*775, <j 3 8 =32-694, 
r 12 =0*758, r a3 =0*758, r 31 =0*927, 

A =0*0584, d u =0*4254, d 12 =-0*05533, 


Ji3= —0*3524. 

. - -fi AfLr .955841^ 

.. ^1(83) “L 1 j n J |_ 0*4254J 

Regression equation of X x on * a and X 3 is 


=0*93. 



MULTIPLE AND PARTIAL CORRELATION 


305 


or 0-0201 (— 3*714) —0’06287 (A'* —2*714) 

-0-0616 (* 3 — 19*86) = 0-] 

5. Calculate /? 1(23 >, r 12 . 3 , r 8l . t for the following trivariate 
distribution. Also find the regression of * 2 on X x and X 3 : 

X l 19 51 30 42 25 18 44 56 38 32 25 10 ‘.0 27 13 49 27 55 

X z 8 15 11 21 7 5 10 13 12 13 5 6 4 8 7 12 6 16 

A', 453321463423443537 

[Hint. *, = 32*28, * 2 = 9-94, * s = 3-67 
<?!= 14-02, a 2 = 443, *3=1*41, 
r i 2 =0*768, r 13 =0’719, r 23 =0-520. 

^K231 = O'8 54, 

'’12.3=0-663, r 13 . 2 = 0*585. 

• Regression equation of *, on * 2 and * 3 is , 

*, = - 17-55 + 2 58* 2 -f6 59*3.] 

6. In a trivariate distribution it is found that *, = 3, a 2 =4, 

<*3 = 5, r 28 =0-40, r ai = 0*61, r l2 =0 70. Prove that the partial 
correlations are 

r 23 .i = —0 0035, r 3l .,=0-49, r 12 . 3 =0-63. 

(Bombay M. A. ’58) 

7. (a) Define a multiple correlation for a trivariate distribution. 

In the usual notations, prove that 

o l _'’ia 2 + '’i3 2 -2r 12 r 23 r ai 
-* 

(b) Show that R is always positive. Also interpret 

^1(23) = I* 

[Hint. For (a), see § 12*3 and for (b) first part see last 
paragraph of §12*3. If R 1(23 ,= 1, then Q. 2 (iii) of §12 5 
gives var (x,. 23 ) = 0 /. e. all the residuals x v23 are zero. It 
means that the expected and observed values of *, coincide.] 

8 . Prove the identity 

^i2.3^23*i^3|.2 ==r i2*3 r 23 , i r 3i s- (Delhi B. A. Hons. ’54) 

9. Show that the correlation coefficient between the residuals 

*i- 23 and x 2 . 13 is equal and opposite to that between x l>8 
and x 2 . a . (Delhi M. A. ’56, ’59; Delhi I. C. A. R # »5I) 

10. Show that if x 3 =ox 1 + fex 2 , the three partial correlations are 
numerically equal to unity, r I3 . 2 having the sign of a, r 23 , 
the sign of b and r 12 . 8 the opposite sign of alb. 

(Delhi B. A. Hons. *53, Bombay M. A. *56) 



CHAPTER XIII 

PRELIMINARY IDEAS ON SAMPLING 

13'1. In our everyday life we come across making an assess¬ 
ment of the population through samples. Thus a trader 
thrusts a conical trowel in a bag of wheat and assesses the quality 
of wheat in the bag by having a look upon the sample so drawn, 
a housewife tests a small quantity of rice to see if it has been 
well cooked and so on. The importance of the theory of 
sampling lies in the fact that for a large population, it is neither 
practical nor necessary to collect data for each and every member 
of the population. Thus in order to have an information about 
the economic condition of the rural population of U. P., it would 

require a huge establishment to collect data from each and every 

individual in the villages and then to tabulate and calculate the 
parameters from it. It would be quite sufficient if a sample of 
villages is selected (with due precautions) and information 
collected from it. The information so gathered may be taken 
to represent the whole rural population of the state. Most of the 
industrial concerns have some sort of quality control over their 

manufactured goods before sending them to the market. If each 

and every item has to be tested, perhaps in some cases no goods 
can be sent to the market e. g. in the case of the match boxes, a 
test would mean the burning of all the matches. 

13*2. Universe or population may be defined as any collection 
of objects or results of an operation. Thus in the above example, 
the persons residing in the villages of U. P. form the universe 
or population, similarly in some experiment, the data regarding 
pressures of a gas may form a population. A universe may 
contain a number of sub-universes e.g. the universe of students 
in a certain state may be divided into universes of students in 
primary schools, secondary schools and colleges; similarly a 
universe may be a part of other universe or universes e. g. the 
rural population of U. P. is a part of the whole population of 
U. P. which itself is a part of the population of India, then of 

Asia and so on. 

By existent universe, we mean an aggregate of concrete 



PRELIMINARY IDEAS ON SAMPLING 


307 


objects such as the students in a college or the population of a 
city, while by hypothetical universe we mean all conceivable ways 
in which an event may happen e. g. all possible ways in which a 
a coin tossed an indefinite number of times may fall; the number 
of ways in which the alphabet can be arranged to form a 
word. 

In a finite universe, the number of objects is finite e. g. the 
students in a college. In an infinite universe, the number is 
infinite e. g. the pressures of a gas or the number of children 
born in a race. 


13*3. Sampling, Types of samples. The most important method 
of drawing a sample is that of random sampling, which implies 
that each member of the universe has the same chance of being 
selected for the sample. Great care has to be taken to ensure 
that the sample drawn is random. Generally selections made 
by human instinct, however carefully random, do contain some 
element of bias in it. In order to avoid individual bias, the 
selection is made with the help of random number tables, or some 
mechanical means like the drawing of a lottery, throw of a dice 
or reulette wheel etc. 

13*4. Simple sampling is a particular case of random sampl¬ 
ing in which the chance of selection of a member of the universe 
for the sample is independent of the previous selection. Thus if 
the selection is made with the help of a dice or toss of a coin, 
the sampling would be simple. However, if a pack of cards is to 
be used, suppose that we draw a sample of 10 out of 50 objects 
and to each card we assign a number corresponding to each 
member of the universe, the probability of individual members 


being selected in the successive draws would be 


1 1 1 
52* 51* 50* 


if the 


card drawn is not put back in the pack after each draw and 
though the selection is random it is not simple. However, if 
after each draw the cards are replaced, the sampling is simple. 
It may be noted that if the population is very large, the random 
sampling would always be simple. 


13*5. Devices for Random Sampling. It has been experi¬ 
mentally established that the selection by human instinct has always 
been subject to individual bias. Suppose a teacher is asked to 
select ten students from his college; he is more likely to select his 
favourite students. In order to avoid this human factor, methods 



308 


STATISTICS 


have been devised to select random samples. Some of these 
methods are given below. 

13*6. Tippet's Numbers. L. H. C. Tippet constructed 
random number tables which consist of 41600 digits taken from 
British census reports to give 10400 four-figure numbers. It 
has been found that these tables are fairly random and have 
played an important role in the sampling technique. The numbers 
may be taken according to the column, or row, or diagonally, on 
any page of the table. 

Similarly Fisher, Kendall and the Indian statistician 
Mahalnobis have published different random number tables. Random 
samples may be obtained by the throw of a dice, draw of a lottery 
by mixing some numbered chits thoroughly and drawing lots or 
some mechanical means. Thus random figures are obtained by 
machines in the draw of Prize Bonds in our country. 

13*7. Stratified Sampling. Random sampling is not always 
the best method of assessing the population. Thus to estimate 
the average income of the inhabitants of a city, it is necessary 
that all sections of the society must be included in our sample 
otherwise there is a likelihood that more rich people or poor 
people may be dominating our sample. For this purpose, it would 
be better first to divide the city into different Strata say according 
to the localities, slums, middle-class localities and bungalow areas, 
business localities etc. and then to select individuals at random 
* f ronl each of these localities. This would ensure that all sections 
of the society are represented in the sample. The above sampling 
techinque is known as stratified sampling. The size of each group 
or stiata should be proportionate to the relative importance in 
the population of the stratum represented by that group. 

If the sampling is done according to the rules of probability, 
the errors that are likely to creep in can be estimated and here 
lies the importance of random sampling. It is in this method 
that the rules of probability are applicable so that the parameters 
of the sample may be used to estimate the parameters of the 
populations, the results of the two samples may be compared to 
test whether they have been drawn from the same universe and 
whether a hypothesis is to be rejected on the basis of the results 
of a sample. The sampling errors can always be reduced by 
increasing the sample size, but it means corresponding increase in 
costs and labour. 


PRELIMINARY IDEAS ON SAMPLING 


3C9 


Example. Suggest a possible source of bias in the following :— 

(i) The wean income per family in a certain town is sought 
to be estimated by sampling from motor owners. (Agra B. Sc. ’59) 

(ii) Readers of newspapers are sampled by printing in it an 

invitation to them to send up their observations on some typical 
ev ent. (Agra B. Sc. ’59, Travancore ’45) 

(Hi) A sample of household survey which includes several 
houses in which none is at home when the investigator first calls. 

(iv) A sample of the unemployed obtained by taking every 
hundredth name in the register of applicants for appointments 
arranged in alphabetical order. 

(v) A survey of incomes of the residents of a locality by 
interviewing the owner of every tenth house. 

(vi) A barrel of apples is sampled by taking a handful from 
the top. 


(vii) A mixture of sand and sawdust is sampled by scooping 
up a quantity from the bottom. 

(viii) A set of digits is taken by opening a telephone directory 
at random and choosing the telephone numbers in the order in which 
they appear on the page. 

(i) Motor car licenses are owned by only very rich people 
and thus the sample would represent only the wealthy class of the 
town. The sample is thus very much biassed. 

(ii) The sampling is not unbiassed since those readers are 
more likely to write on the topic who are interested in it. Some 
of the readers are generally lethargic in writing letters and they 
may not be represented in the sample; moreover, some others may 
have missed the newspaper of the date of publication of the 
invitation and thus unable to send their views. 

.. r ll,) r ThC SUrveyer may have visited at working hours so that 
e ami les in which both husband and wife are employed are 

a sent and they would not be represented in the sample. Thus 
the sample is not unbiassed. 


(iv) The sample is fairly unbiassed since alphabetic arrange- 
ent of names is independent of the status of the individual. It 
is known as systematic sampling which is fairly random if the 
c considered is independent of the order. 


block?? possiblc that the Reality consists of ten-house 
oiocks so that every tenth house is a corner house and may 



310 


STATISTICS 


belong to comparatively richer people, in which case it may be 

biased. . , 

(vi) Generally in transportation, the heaviest apples go down 

to the bottom and the lighter and smaller ones remain at the top, 
so that the apples at the top are not true representatives of the 
whole barrel. Moreover, a shopkeeper is likely to put the best 
apples at the lop to attract the customers and it is possible that 
other apples in the barrel may be of inferior quality. 

(vii) The density of sand is greater than that of sawdust 
and hence when a sample is taken from the bottom, it is likely to 
consist of a higher proportion of sand than in the whole mixture. 

(viii) If we take a used directory, the pages frequently 
used are more likely to be opened and thus the more popular 
numbers are apt to be included in the sample than others. 



CHAPTER XIV 

SIMPLE SAMPLING OF ATTRIBUTES. 

LARGE SAMPLES. 

14*1. Populations and Samples. As already explained in the 
previous chapter, in statistics, the word population is used to refer 
to any collection of objects or results of operations. For example, 
we may speak of the population of dairy cows in Meerut district, 
the population of mileages of automobile tyres, the population 
of prices of a commodity in a city. We may also speak of 
hypothetical population of heads and tails obtained by tossing a 
coin an infinite number of times or the population of all possible 
values which the bank rate can have in twenty years time and 
so on. 

The aim of a statistical enquiry is to find out something about 
some specified population. It is impossible or impracticable to 
examine each and every member of the population since such a 
process will be too costly in terms of time and money. Thu9 
the investigator is led to the study of a selected number of 
individuals from the population and on the basis of this limited 
investigation, he makes inference regarding the whole population. 
This selected number of individuals from a population is called 
a sample. The inferences that can be made from a sample about 
the whole population can never be of a categorical certainty. 
They can only be expressed in terms of probabilities. In order 
that the theory of probability can be applied, the sampling should 
be random. In the case of a non-random sample, there is no way 
to measure the degree of confidence to be placed in any inference 
which can be made from such a sample. The selection of an 
individual from a population is random when each member of the 
population has the same chance of being selected. The aims of 
the theory of sampling arc (1) to find estimates of certain cons¬ 
tants such as mean and standard deviation of the population and 
(2) to determine what degree of confidence can be placed in 
these estimates when they are obtained; in other words, to deter¬ 
mine the limits within which the parameters of the population 
are expected to lie with a specified degree of confidence. 



312 


STATISTICS 


14*2. Simple Sampling or Attributes. In sampling of attri¬ 
butes, we are concerned only with the presence or absence of an 
attribute. More specifically* the sampling of attributes may be 
thought of as drawing samples from a population containing A's 
and not-^’s. The presence of the attribute A may be called a 
success and its absence a failure. For example, in sampling a popu¬ 
lation of tosses of a coin for the proportions of heads and tails, 
we might speak of a sample of one thousand tosses, seven hundred 
of which were heads. In other words the sample consisted of one 
thousand events, seven hundred of which were successes and three 
hundred failures. 

By simple sampling we mean random sampling in which 
each event has the same chance p of success and in which the 
chances of success of different events are independent whether 
previous trials have been made or not. Thus in the throwing of 
a die, the chance of getting an ace is not affected by what was 
obtained on the previous trials and remains the same (i.e. £) for 
subsequent trials, provided, of course, the coin does not wear out 
or is not deliberately manipulated to get some other number by 
the experimenter. 

It should be noted that random sampling is not necessarily 
simple but simple sampling is always random as is evident from 
the above definitions. As a matter of fact, simple sampling is 
a particular form of random sampling. For example, if a bag 
contains 5 black balls and 3 white balls, the chance of drawing a 
black ball at the first trial is £ and if the ball is not replaced, the 
chance of drawing a black ball at the second trial is $ which is 
not the same as before. Hence the sampling is not simple. 
However, the sampling is random since on the first trial, each 
black ball has got the same chance £ of being drawn out and at 
the second trial each black ball has the same chance $ of being 
selected. 

14*3. Mean and standard deviation in simple sampling of 
attributes. 

Suppose we draw N simple samples of size n from a large 
population. If each individual in a sample has a chance p for 
success (i.e. selection) and a chance q=\ — p for failure, so that 
the probabilities of 0, I, 2, ...n successes are given by the 

successive terms of 

(q-\-p) n where p+q=\. 



SIMPLE SAMPLING OF ATTRIBUTES 


313 


This means that the probabilities of 0. 1, 2, 3 ,...n items in 
the sample possessing the attribute under study is q n t n C x q JX A , 
n C t q n ~ % p 2 t ...p n . We know that the mean of this distribution 
is np and standard deviation is \/(npq). Hence the expected 
value of successes in a sample of size n is np and the standard 
error of the number of successes in a sample of size n is y/(npq). 

If instead of the number of successes in each sample, we 
take proportion of successes, the mean proportion of successes 

will be — =/? and the standard error of the proportion of 
n 


successes is («•£.*)- (“■) 


Note. If p or q becomes very small, then pq=p (1 —p) — P 
approx. 

Hence <r = y/(np) = y/(M). It follows that if the proportion 
of successes be small, the standard error of the number of 
successes is the square root of the mean number of successes. 

14*4. Tests of significance for large samples. 

We know that if n is large, the binomial distribution tends 
to normal, so that in the case of large samples properties of 
normal curves can be used. Suppose we wish to test the hypo¬ 
thesis that a given large sample of size n is obtained by simple 
sampling from a population for which the probability of successes 
is p. For normal distribution, we know that 99*7% of its 
members lie within a range ± 3a i\ e. ±3y/(npq) on either side 
of the mean np , so that only *3% of the members lie outside this 
tange. 


Again only 5% of the members of a normal population lie 
outside the range mean ± In [i. e. np±2\/(npq) Hence we have 
the following test of significance for large samples :— 

If the number of successes in a large sample of size n differs 
from the expected value np by more than 3y/(npq), we call this 
difference highly significant. Sometimes a difference of more 
than 2y/(npq) may be called significant whereas a difference of 
less than 2y/(npq) is called insignificant. 

We may put the above test in a mathematical form as 
follows :— 

If z is the standard normal variate, we have 





314 


STATISTICS 


g= X ~ n P 
V("pq)’ 

where x is the observed number of successes in the sample. 
Thus 

(i) If | z | > 3, the difference between the observed and the 
expected number of successes is highly significant. 

(ii) If 2 < \ z \ < 3, the difference may be regarded as 

significant . 

(Hi) If | z | < 2, the difference is not significant . 

14 5. Standard error. The standard deviation of simple 
sampling is called the standard error. The use of the word 
‘error’ is justified on the ground that we usually regard the expected 
value as the true value, and divergence from it as errors of 
estimation due to fluctuations of sampling. But too much 
importance must not be attached to the word ‘'errors’*. Mostly 
the term “standard error” is applied to the standard 
deviation of simple sampling. Apart from this, this term has 
got a wider meaning which we shall discuss when we deal with 
the theory of sampling of variables. 

Probable error. The term “probable error” is used by some 
authorities instead of “standard error”. The probable error is 
0*67449 times the standard error. The reason for the use of 
the term “probable error" lies in the fact that in the normal curve, 
the quartiles are distant 0‘67449a from the mean, so that the 
probability that a deviation is in excess of the probable error is 
and is equal to the probability of a deviation being less than 
the probable error. 

If P denotes the proble error, we have 

/>=• 67449(7. 

3<r =.Js9= 4 ' 5PapprOX - 

Hence the rule that the observed deviation should not be 
greater than 3 times the standard error is then roughly equi¬ 
valent to a rule that it should not exceed 4*5 times the probable 
error. 

The probable error is often used as a measure of dispersion, 
instead of standard error, since it has the merit of being easily 
understood; but the term itself is misleading, for it is not actually 
an error and so its use is being discarded in favour of the 


SIMPLE SAMPLING OF ATTRIBUTES 


315 


standard error. The 
following figure is 
helpful for under¬ 
standing the nature of 
probable error and the 
relationship between 
probable error and 
standard error. Half 
of the observations lie 
between Mean + P. E. and Mean-P. E. and hence the chance 
that an observation taken at random will lie between these limits 
is equal to its chance of falling outside them. 

14*6. Precision. We have seen above that the standard 
error measures the unreliability of the value of p. Fluctuations ot 
the observed proportion will increase with an increase in the standard 
error. The reciprocal of the standard error is called precision and 
measures the reliability of the observed proportion. Since the 
standard error, varies inversely as the square root of the number 
of observations in the sample, the precision will vary as the 
square root of the number of observations. Thus to double the 
precision (or halve the standard error), we should increase the 
number of observations four times. 

14‘7. Conditions for simple sampling. The use of the formula 
V(npq) £or v(?)i for the standard error of p is justified 

only under certain conditions which give rise to simple sampling 
in practice. There are three such conditions given below 

(1) The probability p for drawing an individual with attri¬ 
bute A on randon sampling must remain constant and is the same 
for all samples. This means that the proportion of individuals 
with attribute A in the population remains constant at the drawing 
of each sample. Thus the theory of simple sampling cannot 
apply to the variations of the death rate in localities having 
population of different ages and sexes, or to death rates in 
successive years during a period of improved sanitary conditions. 

(2) The properties of individuals with attribute A must 
remain constant at the drawing of each individual member of the 
sample. Thus in the case of death rates mentioned above, the 
samples must not only be of the same age and sex composition 
and being under the same sanitary conditions, but each sample 




316 


STATISTICS 


should contain persons of one age and one sex only. For if in 
each sample, there were persons of both sexes and different ages, 
the condition would be violated, the probability of death during 
a given period being different for the two sexes, or for different 
age-groups. 

It should be noted that the condition for the constancy of p 
will be satisfied if 

(i) the individuals are replaced after each drawing, before 
making the next drawing; by such a device, the constitution of 
the population does not change so that the chance of success 
remains the same; 

(ii) the population is infinite; in this case, the removal of 
a finite number of individuals does not affect the proportion of 
individuals in the population possessing a specified attribute; 

(iii) the population is very large so that p may be taken to 
be constant without any appreciable error, provided the sample 
is not also large. This is a very important case for the appli¬ 
cation of the theory of simple sampling to many practical data. 

(3) The third condition for simple sampling is that the 
individual events must be completely independent of one another, 
like the throws of a coin. For example, if we were dealing with 
deaths from an infection or contagious disease, the theory of 
simple sampling could not be applied in such a case even if the 
sample population consisted of persons of one age and one sex 
only. For if one person in a certain sample has contacted the 
disease in question, he has increased the possibility of other 
persons dying from the same disease. 

14*8. Solved Examples. 

1. In some dice-throwing experiments , Weldon threw dice 
49,152 times , and of these 25,145 yielded a 4, 5 or 6. Is this 
consistent with the hypothesis that the dice were unbiased ? 

The total number of throws=49,152. 

The chance of throwing a 4, 5 or 6 with one die = J. 

The expected value of the number of successes 

= *x49152 = 24576 

and the observed value of successes=25145. 

Thus the excess of the observed value over the expected 
val ue=25145—24576 = 569. 

The standard deviation of simple sampling 

- V(npq) = V(49152 X J X $) = 110 9. 


SIMPLE SAMPLING OF ATTRIBUTES 


317 


»r x — n p 569 

Hence z= 7 - -. = T77rh ==5*13. 

y/ (npq) 110-9 


Since the observed deviation is 5*13 times the standard 
error, it is therefore highly improbable that it arose as a sampling 
fluctuation. We must therefore seek some other reason 
for this deviation. Hence it seems reasonable to suspect that the 
dice were biased. 

2. A sample of 900 days is taken from meteorological records 
of a certain district , and 100 of them are found to he foggy. What 
are the probable limits to the percentage of foggy days in the 
district ? (Agra 1958. U. P. P. C. S. ’52) 

. Here p=28S = o and q=%. 

the standard error ol the proportion of foggy days in the 

district 


(^)=V / (JXtfX 5 5; r )=,00l05 

= 105 per cent. 

Hence, taking & to be the estiirate of the number of foggy 
days, we have that the limits to the proportion of foggy d jys in 
the district are ( 1 2 n ±3x 1*05) per cent, i.e. 8 per cent and 13*25 
per cent approximately. 

3. Certain crosses of pea gave 5321 yellow and 1804 green 
seeds. The expectation is 25 per cent of green seeds on a Mende- 
lian hypothesis. Can the divergence from the expected value have 
arisen from fluctuations of simple sampling only ? 

The total number of pea seeds examined 

= 53*1-1-1804 = 7125. 

The expected value of green seeds 

= 25% of the total 
= i x 7125= 1781. 

And the observed result=1804 green seeds. 

The difference between the observed and the expected 
values= 1804 —1781 =23. 

Also the standard error of green seeds 

= V(W) = \/(7125x ixi) = 36-6. 

Hence z = — =*6 

36-6 


Since the divergence of the observed from the expected value 
is only *6 times the standard error, we may very well say that this 
divergence is due to fluctuations of simple sampling. 



318 


STATISTICS 


4. A random sample of 500 pineapples was taken from a large 
consignment and 65 were found to be bad. Estimate the proportion 
of bad pineapples in the consignment , as well as the standard error 
of the estimate . Deduce that the percentage of bad pineapples in 
the consignment almost certainly lies between 8'5 and 17*5. 

(I. A. S. ’54) 


Here p —'boo — iW <7 *1 bo* 

/. the standard error of the proportion of bad pineapples 
in the consignment = V('io 3 o X i S u 2 tf X j 

= 1*5 per cent. 

Hence taking iVd='13 to be estimate of the number of bad 
pineapples in the consignment we have that the proportion of 
bad pineapples in the consignment are (13 ±3 X 1*5) per cent, i.e. 
8*5 per cent and 17-5 per cent approximately. 

5. Show that for a random sample of size 100 drawn with rep¬ 
lacement. the standard error of a sample proportion cannot exceed 
*05. 

Since p + q=\ (i*e. constant), the product pq will be maxi¬ 
mum, when p = q=^h. Hence the maximum value of S. E. of 


proportion= ^^=\/(£ x 2 x Tod) = 0’05. 

6. 400 eggs are taken at random from a large consignment and 

50 are found to be bad. Estimate the percentage of bad eggs in the 
consignment and assign limits within which the percentage probably 
lies. 


Here p= v 0 °d = i and q=% 


• • 


the standard error of the proportion of bad eggs in the 
nsignment =V(8 x « x 40o)= s ^ 


= •016536=1*6536 per cent. 

Hence taking £ to be estimate of the number of bad eggs, we 
have that the limits to the proportion of bad eggs in the consign¬ 
ment are 12*5 per cent ±3x 1*6536 per cent, i.e. 7*5 per cent and 
17*5 per cent approximately. 

7. In breeding certain stocks , 408 hairy and 126 glabrous 
plants were obtained. If the expectation is one-fourth glabrous , is 
the divergence significant , or might it have occurred as a fluctuation 
of sampling ? 

Total number of plants=408-f-126 = 534. 




SIMPLE SAMPLING OF ATTRIBUTES 


319 


Expected number of glabrous plants = \ x434= 133*5. 
Observed number of glabrous plants=126. 

Difference between expected and observed number of glabrou s 
plants= 133*5-126 = 7*5. 

Here p = l, and « = 534. 

Hence the standard error of the number of glabrous plants 

— V(534x Jx*j = 10*0 approx. 



which shows that the difference is very insignificant and might 
have occurred as a fluctuation of sampling. 

8. Balls were drawn from a bag containing equal numbers of 
black and white bal/s t each ball being returned before drawing 
another. The records were then grouped by counting the number 
of black balls in consecutive 2's, 3’s, 4’s, 5’s etc. The following 
are the distributions so derived for grouping by 5's t 6’s and 7’s. 
Compare actual with theoretical means and standard deviations. 


Successes 

(a) Grouping 
by ft ves 

(b) Grouping 
by sixes 

(c) Grouping 
by sevens 

0 

30 

17 

9 

1 

125 

65 

34 

2 

277 

166 

104 

3 ' 

224 

192 

151 

4 

136 

166 

148 

5 

27 

69 

1 95 

6 

— 

8 

40 

7 

— 

| 

4 

Total 

* 

819 

683 

585 


Since there are equal numbers of black and while balls, the 
chance of drawing a black ball is so that we have 


p = h and q~\. 



320 


STATISTICS 


Hence the theoretical distribution for group (a) is given by 

819 (* + *)«. 

Theoretical mean=np=5 x *=2*5 and theoretical stan¬ 
dard deviation =V(5xlxi) s =ril8 

Actual mean and standard deviation are found as follows :— 




125 


277 


224 


136 


-2 


— 1 


-60 

-125 


224 


272 


Total 


819 


392 


120 

125 


224 

544 


243 


1256 


Assumed origin A = 2. 

... M=A+^Zfd= 2 + gj?=2+-48 = 2-48, 

= V / [ 1 8 2 19 6 - ( ' 48)! ] = 1 ' 14 - 

• Similarly theoretical and actual means and standard deviations 
for groups (b) and (c) can be found. This is left as an exercise for 

the students. 

For group (b) Theo. Af=3 f cr 1*225 
and Actual M=2*91 t a= 1*26. 

For group (c), Theo. M= 3*5, a= 1*323 
and Actual Af=3*47, <r= 1*40. 

9. The probability that men of 25 will die before they are 
45 is • 147, and that among 1,000 members of a certain profession 
who are 25 years of age 200 die before they reach 45} what is the 
probability of finding this mortality rate in a random sample of the 



SIMPLE SAMPLING OF ATTRIBUTES 


321 


whole population , and hence with what confidence can we say that 
these men differ significantly from the rest of the population in 
mortality rate between ages 25 and 45 ? 

Here p = * 147 and q= ’853 

Expected number = 1000 x * 147= 147 
and observed number=2C0 

The difference between the observed and expected numbers 

= 200-147 = 53. 

The standard deviation of simple sampling 

= V( n Pq)—V( 1000x * 147x *853) 

= 11-24 


Hence the difference between the observed and expected 
frequencies is highly significant. A deviation of this magnitude 
(4*7 ct) occurs so rarely, about once in a million (from the tables), 
that we can with confidence say that this population differs 
significantly from the rest. 

10. Out of 200 individuals 40 per cent shew a certain trait, and 
that the number expected on a certain theory is 50 per cent / find 
whether the number observed differs signifiaantly from expectation. 

Here /?= *5 = A and q = 1 — £ = A, n = 200. 

The standard error of p 


“>/©“%/ 0 X}X 2Uo) = ‘° 35 

'5— ‘4 _ „ 


Hence the difference is significant. 

11. In a newspaper article of 1,600 words in English 36 percent 
of the words are found to be of Anglo-Saxon origin. Assuming 
that simple sampling conditions hold, estimate the proportion of 
Anglo-Saxon words in the writer's vocabulary and assign limits to 
that proportion. Suggest possible causes which might break down 
the three conditions for simple sampling. 

Here P='iqq and </=iVo. 

The standard error of p = VlVVo x -,Vo x r 0 Vo] = 0*012 

= 1 *2 per cent Anglo-Saxon. 

Whatever may be the percentage of words of Anglo-Sexon 
origin, in the writer’s vocabulary, a simple sample should give a 
percentage within three times this standard error. 


322 


STATISTICS 


Hence taking ffo (/. e. 36%) to be the estimate of words of 
Anglo-Saxon origin in the writer’s vocabulary, we have that the 
limits are (36 +3 x 1*2) per cent i.e. 39 6 per cent and 32*4 per cent 

approximately. 

The sampling is almost certainly not simple. The possible 
causes may be; 

(1) Nature of subject matter might require words of certain 
type, e. g. scientific words probably would not be Anglo-saxon. 

(2) The occurrence of one word influences the occurrence of 

the next. 

12 . Given that, on the average, 4% of insured men oj age 05 
die within a year , and that 60 of a particular group of 1000 such 
men died within a year, show that this group cannot be regarde 
as representative sample, seeing that the actual deviation of the 
proportion of deaths is more than three times the S. E. of the pro¬ 
portion for samples of this size. 

Expected value=4 per cent. 

Observed value=^^-j^==6 per cent. 

Difference = (6—4)=2 per cent. 

Standard error of the proportion of deaths. 

_ /( lx 2 ^x J -V 1 ' 55 

““ V V25 25 1000^ 25U 

= x 100 per cent 
= 620 per cent. 



Hence the observed difference is more than three times the 
standard error of the proportion for samples of this size. 

13 . Experience has shown that 20 per cent of a manufactured 
product is of the top quality. In one day's production of 400 articles 
only 50 are of top quality. Does this contradict our hypothesis of 
20 per cent ? [You are given that if X is normal with mean y and 

variance o 2 , then P { i X—y | ^ l’96o}=0‘05]. 

Here p= i 2 o a o = £» <7= 1-6=2; observed proportion =-/<& = «• 

Difference between the expected and observed proportions 

_A_l_-3_07 s 

— a — jo — u/o. 

Standard error of p= ( P J)= (j x 3 X 4 o,-) = ' 02 - 


SIMPLE SAMPLING OF ATTRIBUTES 


323 


•n > 

= 3 * 75 > 1-96. 

Hence the difference is significant and the hypothesis of 
20 per cent is contradicted 

14 . In a locality containing 18000 families, a sample of 840 

families was selected at random. Of these 840 families, 206 families 
were found to have a monthly incom ? of Rs. 50 or less. It is desired 
to estimate how many out of the 180G0 families have a monthly 
income of Rs. 50 or less. Within what limits would you place your 
estimate ? (U. P. C. S. 1948) 

Here p = and q=l\l. 

.*. The standard error of the proportion of families having 
a monthly income of Rs. 50 or less 

-•v/ewcs-s-ds) 

= *015= 1*5 per cent. 

Hence taking (or 24 5%) to be the estimate of families 
having a monthly income of Rs. 50 or less in the locality, we have 
that the limits are (24*5 ± 3x1*5) percent i. e. 20 per cent and 
29 per cent aproximately. 

15 . A dealer takes 100 samples from a consignment of 10000 
items of a certain goods and finds that there are 50 items of grade I 
worth Rs. 5 per thousand , 30 items of grade II worth Rs. 4 per 
thousand, and twenty items of grade III worth Rs. 3 per thousand. 
Within what limits should the value of the consignment be fixed ? 

For items of grade I, we have 

/> = *i fi o°o = $ = 0 5 and q= 1 — k = A =0 5. 

For items of grade II, we have 

/*=T a oo=0-3 and q = 0 7. 

And for items of grade III, we get 

P= iVo =0 , 2 and q= 0 8 

Total number of items in the sample =100. 

Hence the standard error of the simple sampling is 
for grade I = \/(0’5 x 0*5 x 100) = 5*0, 
for grade 1I = V( 0, 3 x 0*7 x 100) = 4 6 
and for grade III = -\/( 0 '2 x0 8 x 100) = 4‘0. 

Hence we have the lower and upper limits for the percentages 
of the three grades as follows: 



324 


STATISTICS 


Limits 

Grade 

•. 

' 

I 

I 

II 

III 

Lower 

50—3 x 5 
=3f% 

30-3x4-6 
= 16-2% 

20-3x4 
= 8% 

Upper 

504-3x5 
= 65% 

304-3x4-6 
= 43-8% 

204-3x4 
= 32% 

1 


Now the highest value that can be placed upon the consign¬ 
ment is that value for which grade I is the highest and grade III 
is the lowest, so that we get 

grade 1=65% 

and grade III = 8% 

Then grade II = { 100-(65 + 8)} = 27%. 

Hence the highest value of the consignment 

= 65% of Rs. 5 + 27% of Rs. 44-8% of Rs 3 
= Rs. 32-54-Rs. 10-84 -Rs. 2 4 
= Rs. 45-7. 

Similarly, the lowest value that can be given to the consign¬ 
ment is that value for which grade I is the lowest and grade III 
the highest, i. e. ' 

Grade 1 = 35%. 

Grade 111 = 32%, 
so that grade II = { 100—( 354 -32» = 33%. 

Hence the least value of the consignment 

= 35% of Rs. 54-33% of Rs. 44-32% of Rs. 3 
= Rs. 17*54-Rs. 13*24-Rs. 9 6 
= Rs. 40-3. 

Thus the value of the consignment almost certainly lies within 
the limits of Rs. 40"3 and Rs. 45*7. 

Exercises 

1. A coin is tossed 400 times and it turns up head 216 times. 

Discuss whether the coin may be an unbiased one, and 

explain briefly the theoretical principles you would use for 





SIMPLE SAMPLING OF ATTRIBUTES 


325 


this purpose. (Agra 1949) 

[Ans. 2 = 1*6 and so the coin can be said to be an unbiased 

one ] 

A biased penny is tossed 100 times and comes down heads 
70 times. What are the probable limits to the probability of 
getting a head in a single trial ? 

[Ans. p lies between 0*55 and 0*85 ] 

In tossing a hundred pennies a student gets 66 heads. Do 
you think that he has used sufficient care to obtain a random 
toss each time ? 

[Ans. z = 3*2 and so the toss cannot be said to be random] 

A certain cubical die was thrown 9,000 times, and a 5 or 6 
was obtained 3,240 times. On the assumption of random 
throwing, do the data indicate an unbiased die ? 

[Ans. Die is biased.] 

Twelve dice are thrown 3086 times and a throw of a 2, 3, 4 
is reckoned as a success.. Suppose that 19142 throws of a 
2, 3 or 4 have been made out. Do you think that this 
observed value deviates from the expected value ? If so, can 
the deviation from the expected value be due to fluctuation 
of simple sampling ? 

[Ans. 2 = 6 5 and so the deviation is most unlikely to have 

risen due to fluctuations of simple sampling.] 

In a sample of 100 in a district, 60 arc found to be wheat- 
eaters and 40 rice-eaters. Can we assume that both the food- 
articles are equally popular ? 

[Ans. z = 2 and so the difference may be due to fluctuations 
of simple sampling, /. e. the two food articles may be 

considered as equally popular] 

Balls are drawn from a bag containing equal numbers of 
black and white balls, each ball being returned before drawing 
another. In 2250 drawings, 1018 black and 1232 white balls 
have been drawn. Do you suspect some bias on the part of 
the drawer ? 

[Ans. z=4*5 so that it seems reasonable to suspect that the 

drawer was biased.) 

A man buys 1000 sacks of potatoes. He finds that from 
1,000 potatoes chosen from the sacks at random, 400 are of 



326 


STATISTICS 


class A , worth Rs. 10 a sack ; 250 are of class B, worth Rs. 7 
per sack ; 200 are of class C , worth Rs. 5 a sack ; and 150 
are of class D t worth Rs. 4 per sack. What are the upper 
and lower bounds for the value of the potatoes ? 

[Hint. Find the upper and lower values for the four classes 
of potatoes by adding and subtracting three times the 
standard error of the proportions from actual proportions 
and express it in percentages. Then the maximum value of 
the potatoes is the value for which class A has the maximum 
value and classes C and D the minimum value. This gives 
on calculations Rs. 7667*10. 

Similarly the minimum value of the potatoes is that value 
for which classes C and D have the maximum values and 
class A the minimum. This gives Rs. 7032*9. Thus the 
upper and lower bounds for the value of the potatoes are 
Rs. 7667*10 and Rs. 7032*9]. 

9. Balls are drawn from a bag containing equal numbers of 
black and white balls, each ball being returned before 
drawing another. Out of 4,096 drawings, 2,030 balls were 
black and 2,066 white. Is this divergence probably signi¬ 
ficant of bias ? 

[Ans. z = i fl a - so that the difference is not significant]. 
14*9. Comparison of Large Samples. 

Two samples from distinct materials or* different populations 
give proportions of A's as p t and p it the numbers of observations 
in the samples being n x and « 2 respectively. 

Two cases arise : 

(a) If the two populations are really similar as regards 
the proportion of /Cs, can the difference between the two pro¬ 
portions have arisen merely as a fluctuation of simple sampling ? 
In this case, we have no theoretical expectation as to the pro¬ 
portion of A*s in the populations from which either sample has 
been drawn. The best guide in such cases would be to take the 
mean proportion in the two samples together for the proportion 
of ^’s in the populations. Hence the proportion of A*s in the 
opT iti > is cnbjj'/J * by 

_ P\K\ *f P *no 

Po n x -Ma 



SIMPLE SAMPLING OF ATTRIBUTES 


327 


then 


Let £, and E 2 . be the standard errors in the two samples; 


2_ PoQo 




n . 


and 


r 2 _/*o*7o 

— — • 
Wo 


then 


If E be the standard error of the difference between p x and p 2 , 




...(I) 


Let 


z = 


Pl^Pz 


If z > 3, the difference between p, and p 2 is a real one and 
is not merely due to fluctuations of simple sampling. If z < 3, 
the difference may be due to fluctuations of simple sampling. 

(b) If the proportions of /l’s are not the same in the two 
populations from which the samples are drawn, but p x and p 2 are 
the true values of the proportions, the standard error £ of the 
difference in this case is given by 




n 


n. 


...( 2 ) 


£ 


< 3, the difference might have arisen due to 

fluctuations of simple sampling only and may vanish on taking 
fresh samples in the same way from the same material 

14*10. Solved Examples. 

1. In a large city A, 20 per cent of a random sample of 900 
schoolboys had a certain slight physical defect. In another large 
city B, I8'5 per cent of a random sample of 1,600 school boys had 
the same defect. Is the difference between the proportions signi - 

[Agra M. Sc. (Maths.) 1958] 

20 1 


ficant ? 
Here 


and 


Hence 


Now 


^=100 = 5 ’ " 1 = 90 ° 

/7a= r00~ = 200* "2=1600. 

"i/ j i + w 2P 2 _ 1804-296 ' 
~n x -\-n 2 ~ 900 4 -1600 

q 0 = 1 — *19 = *81. 

pz /I 1 

E =Po<Io 

= * 19x *81 


Po 


•19. 




i 


900+1600 


) 


•0017. 



328 


STATISTICS 


Again 


Then 


£=*04 approximately. 
1*5 - 


z— 


Pa~Pz * 015 


= 4rr—=*37. 


E ”*04 

Since z < I. the difference between the proportions is not signi¬ 
ficant and might vanish on taking fresh samples. 

2 . In a random sample of 500 persons from town A, 200 are 
found to be consumers of cheese. In a sample of 400 from town B, 
200 are found to be consumers of cheese. Discuss the question 
whether the data reveal a significant difference between A and B so 
far as the proportion of cheese consumers is concerned. 


Here 


so that 
and 


Hence 
This gives 

Now 


_ 20 ° 2 

/,x ~500“5 
__200_1 
^ 2 “400~2 * 

7V»i+/V»* 20 04-2 00 4 

Po ~~ir x ~+~n 2 "5004-400“ 9 

<7o — 1 — 6 — 9 - 

5 ( 1 

^ 9 9 \5U 

£=•033. 


500 + 4So)=-°° lin - 


?=p±-p* = : 


4 — * 5 


= 3*03. 


£ “ 033 

Since z > 3, the difference is significant. 

3. Out of 2,000 men of age 25 employed in a certain trade 
(A) 400 die before they are 50, whereas out of 1,000 men of age 25 
in another trade (B), only 175 die. What is the probability of 

occurence of this difference in random sampling ? 

400 1 


Here 


Pl 2000 5 2 ’ 

^4ob 5 o=-' 75 ’ 


Pdh+Pzn* 4004-175 _. 1QO 

r. . i • 


"11,4-n* “20004-1000 

E '=Po <h ('- + !,;) 

1 1 ^ 

20C0+10007* 


= * 192 x*808 


- 

( 


SIMPLE SAMPLING OF ATTRIBUTES 


329 


This gives £'==*0152. 

The difference in the death rates 


=Pi— p 2 ='2— *175= *025. 

= ^-^ = -025 

E *0152 ° * 


From the tables for the areas of the normal curve, \vc find 
that the probability of reaching or exceeding this value of z by 
mere chance is *05 i. e. I in 20. 


4 . In two large populations there are 30 and 25 per cent res¬ 
pectively of fair-haired people. Is this dfference likely to be hidden 
in samples of 1200 and 900 respectively from the two populations ? 


Here 
so that 


Pi 100 *30,/7 2 100 -*25, 

Pi~Pz = *05, 


p«_Pi<Ii . PiCl* ‘30x *70 *25 x *75 

” + - 1200 + 900 ’ 


This gives on calculation, £=*0195. 


Z=Pj ~E~- =W5 = 2 ' 56 Dear,y 


Hence it is unlikely that the real difference will be hidden, 

5. In a random sample of 800 adults from the population of 
a certain large city, 600 are found to have dark hair. In a random 
sample of 1000 adults from the inhabitants of another large city, 
700 are dark-haired. Show that the difference of the proportions 
of dark-haired people is nearly 2'4 times the standard error of 
this difference for samples of the above sizes. 


Here 


Pi 

Po 


•*«"- n — -2«o. 

bOOl / 7 2— 1 uuo 


Pi x \±Pff 8 = 600 + 700 _ 13 
* x +x 2 800 +1000 “18 


n — I _ 

( /0 — 1 — ih — 1 8 • 

Pi P 2 — 4 . & = 2 V =05 

Xi*8 (*5oXTT“oVo)J 
1 //117\_7'65 

2 J~ 360 


360\/C 2 ™ —*0213. 


Px-P* 05 

£ =, 0213 = 2 4 a PP rox * 


6. In a random sample of 500 men from a particular district 
of U. P., 300 are found to be smokers. In one of J,000 men from 
another district , 550 are smokers. Do the data indicate that the 



330 


STATISTICS 


two districts are significantly different with respect to the preva¬ 
lence of smoking among men ? (P. C. S. 1953) 

Here *-1**-*. /> 2 =iWo=i*. 

_P\”i+P*n 2 300 + 550 17 

Po 500+1000 30* 

<7o= I so = £o» 

(roo + Tooo)] = 0271, 

Pi—P 2 =*6—*55=*05. 

2 r=-^^- 2 =^~i= 1*9 approx. 

Hence the difference is not significant i. e. the data do not 
indicate that the two districts are significantly different with res¬ 
pect to the prevalence of smoking among men. 

7. The subject under investigation is the measure of depen¬ 
dence on Tamil on words of Sanskrit origin. One newspaper article 
reporting the proceedings of the Constituent Assembly contained 
2,025 words of which 729 words were declared by a literary critic 
to be of Sanskrit origin. A second article by the same author 
describing atomic research contained 1,600 words of which 640 
words were declared by the same critic to be of Sanskrit origin. 
Assuming that simple sampling conditions held, estimate the limits 
for the proportion of Sanskrit terms in the writer's vocabulary, 
and examine whether there is any significant difference in the 
independence of this writer on words of Sanskrit origin in writing 
on the two subjects. (|. a. S. 1947) 

If Po denote the proportion of Sanskrit terms in the writer’s 
vocabulary in both the articles taken together, then 

'»=j5TT5ro=- 3777 = 37 ' 77 * 

and < 7 0 = -6223 = 62-23% . 

Now the proportion of Sanskrit terms in 
= 2 2 (n&‘=*36=36% and the proportion in the 
=AVo =*40 = 40%. 

The difference=40—36=4%. 

The standard error of the difference between these two 
proportions is given by 

£= V [**• (k +»-,)] 


the first article 
second article 


““Vl'37 77 x 62*23) hroVar +*TiVg-)J per cent 



SIMPLE SAMPLING OP ATTRIBUTES 


331 


= 1*16 per cent. 




And so the difference is a real one and could not have 
arisen from fluctuations of simple sampling. Hence there is 
significant difference in the dependence of the writer on words 
of Sanskrit origin in writing on the two given articles. 


To estimate the limits for the proportion of Sanskrit terms 
in the writer’s vocabulary we first find the standard error of 
the proportion of Sanskrit terms in the writer’s vocabulary. 
This is given by 


as =\/(~) where /»=2025+1600=3625 
/ /37*77x 62*23\ OI 

= vl - 3625 J == ’ 81 per cent. 

Hence taking p 0 =37*77 percent to be the estimate of the 
Sanskrit words in the writer’s vocabulary, we have that the limits 
are (37*77 ± 3 x -81) percent i. e. 35‘34% and 40*20% approxi¬ 
mately. 


8. // for one half of n events the chance of success is p and 
the chance of failure is q, whilst for the other half the chance of 
success is q and the chance of failure is p. Show that the standard 
deviation of the number of successes is the same as if the chance of 
success were p in all the cases i.e, y/(npq) but that the mean of 

the number of successes is ” and not np. 

Let <7j and <j 2 denote the standard deviation of first and 
second halves of n events. 

Then o x 2 —\npq and of = hnqp. 

Hence the standard deviation of the number of successes. 

= VW+a 2 2 ;= (^pq + 

If p 0 denotes the proportion of successes in the n events, then 

n n 

2 p +r q 

Pq ~ — n — n — = ^ 

2 + 2 

Hence the mean of the number of successes =np 0 —\n. 


332 


STATISTICS 


9. In a certain association table , the following frequencies 
were obtained : 

(AB) =309 t (A(3)=2J4 , (aB)=132, (a(3)=119. 

Can the association of the table have arisen as a fluctuation of 
simple sampling , the true association being zero. 

We have 

(A) = (AB) + (Ap) = 309-\-214=523, 

(B) =(AB)+(ccB)=309-{- 132=44I=/? 1# 

(£) = (/! j 8)+(a i 8)=214+119=333=/; 2 , 

AT= ( B) + ((3) =441 -f 333 = 774. 

Proportion of /I’s in B*s=^~ = ^? = *701 

( B) 441 

Proportion of A's and /3’s=^~^=^= *643. 

Ir/ 

The difference of the two proportions=*701 — *643= *058. 

The proportion of A’s in the universe 

(A) 523 

= T7 = 774 = 676== Po> sa y- 
Then q 0 = l_-676= *324. 

Hence the standard error of the difference between the two 
proportions is given by 

CW->- 676 *' 324 ( 451 + 353 )’ 


whence £'=•034. 


difference -058 , „ 

2 =—=-034 =1 * 7 


3. 


Hence the association between A and B is not a real one 
and might have arisen as a fluctuation of simple sampling. 

10. If a series of random samples of different sizes is taken 
from the same material , show that the standard deviation of the 
observed proportion of successes in such sets is s , where 



and H is the harmonic mean of the numbers in the samples. 

Let there be/* samples of/7, individuals each, /, samples of 

w 2 individuals each, f 3 of n 3 and so on. Let p be the chance of 
success and q that of failure. The variance of the observed 
proportion of successes in f x samples of n x individuals each 


pq . pq 


= + terms etc. 

* 1 /ij 



SIMPLE SAMPLING OF ATTRIBUTES 


333 


..( 1 ) 


Hence the standard deviation s of the observed proportion 
of successes in all the sets is given by 

Ns*=.pq 

i ?! 2 »«3 

where N is the total number of samples. 

Now the harmonic mean If of the numbers in the samples 
is given by 


f\ fz ,/k _j_ 'N 
+ - 


1 ——r - ->/— + 1 

« ^ L». + "» + "> ’"J 


Then (1) gives 


as was required. 


Exercises 



In a simple sample of 600 men from a certain large city, 
400 are found to be smokers. In one of 900 from another 
city 450 are found smokers. Do the data indicate that the 
cities are significantly different with respect to the prevalence 
of smoking among men ? (Agra B. Sc. 58) 

[Ans. Here z = 6’56 nearly, so that the difference is 

significant.] 


2. In a random sample of 1000 persons from town A, 400 are 
found to*be consumers of rice. In a sample of 800 from 
town B, 400 are found to be consumers of rice. Discuss the 
• question whether the data reveal a significant difference 
between A and B so far as the proportion of rice-consumers 
is concerned. 

[Ans. z = 4*2, so that the difference is significant] 

3. One thousand articles from a factory are examined and 
found to be 3 per cent defective. Fifteen hundred similar 
articles from a second factory are found to be only 2 per cent 
defective. Can it reasonably be concluded that the product 
of the first factory is inferior to that of the second ? 

[Ans. z=l-6, so that the difference is not significant and 
hence we cannot reasonably say that the product of the first 
factory is inferior to that of the second.] 

4. In a town A, 19400 persons were observed and 27 per cent 
of them were found to be short-sighted. In town B, 29750 
persons were observed and 30 per cent were found to be 


11.9 ) 



STATISTICS 


short-sighted. Can the difference observed in the percentage 
of short-sighted persons be attributed solely to the fluctua¬ 
tions of sampling ? 

[Ans. z= 7*1 and so the difference is a real one.} 
The sex ratio at birth is sometimes given by the ratio of male 
to female births, instead of the proportion of male to total 

births. If Z is the ratio, i.e. Z =show that the standard 

error of Z is approximately (1+Z) (f). « being large, 

so that deviations are small compared with the mean. 



CHAPTER XV 

THE SAMPLING OF VARIABLES. 
x LARGE SAMPLES. 

15*1. In the previous chapter, we discussed the sampling of 
attributes. We classified each member of a sample under one of 
two heads, success or failure. In the case of sampling of variables 
such a classification is no longer possible. Each individual 
member of the sample provides a value of the variable and these 
values are generally spread over a range, which may be limited 
or unlimited. It is often convenient to think of the variate 
as always having an infinite range. When the range is actually 
finite, we may take the frequency to be zero outside that range. 
The number of possible values of a continuous variable is infinite 
in any finite range, however small the interval may be. In other 
words, the population of variate values will always be infinite. 
Hence the drawing of a finite random sample does not affect the 
drawing of any other sample from the population of variate 
values and consequently the sampling is always simple. The 
examples for the sampling of variables are provided by statures 
of men, ages of persons at death, prices of a commodity etc. 

The aims of the study of the sampling of variables are the same 
as these of the sampling of attributes. These are (a) to compare 
the observed value with the expected value and to find how far 
the difference between the two values can be attributed to 
fluctuations of simple sampling, (b) to estimate the parameters 
of the parent population from the sample, such as mean of a 
variate, and (c) to see how reliable our estimates are when they 
are obtained. The problems set forth in (b) and (c) above are 
called the problem of estimation and the problem of testing of 
hypothesis. 

Statistical Estimation. It is the technique of estimating the 
parameters of the population from that of a sample. Thus in 
order to lest the amount of dust in a bag of grain, a sample of 
the grain is taken carefully from the middle portion of the bag, 
weighed, and after cleaning it is weighed again so that the amount 
of dust in the sample is found out This weight of dust in the 



336 


STATISTICS 


sample multiplied by the ratio of the weight of grain in the bag 
to the weight of the sample, gives an approximation to the weight 
of dust in the whole bag. 

Testing a Hypothesis. By testing a carefully drawn sample, 
it is possible to verify a hypothesis regarding the population. 
Thus to test the efficacy of a drug against a disease, we can find 
from a sample the number of persons attacked even after 
using the drug and the number attacked who have not used that 
drug A similar example is that of a rope manufacturer who 
would like to adopt a new process if the strength of the ropes is 
increased. From past experience he knows that the breaking 
strength of the ropes is a normal population with mean 100 Ibs - 
wt. He will thus want to test the hypothesis that the new process 
gives ropes with breaking strength distributed according to normal 
law with a mean more than 100 Ibs.-weight. If this mean is less 
than 100 lbs., the hypothesis is rejected. 

Errors in testing hypothesis. The first type of error in test¬ 
ing a hypothesis is that a correct hypothesis is rejected while the 
second type of error lies in the fact of accepting a wrong hypo¬ 
thesis. The statistical testing of hypothesis aims at limiting the 
risk of the first type of error to a pre-assigned value, say 1% or 
5° u , and to minimize the second type of error. 

Null Hy pothesis. Suppose we are given a sample from which 
a certain statistic such as mean is calculated. We assume that 
this sample is drawn from a population of known form for which 
the corresponding parameter is tentatixly specified. We call this 
tentative specification a null hypothesis. For example, in a coin- 
tossing experiment, we want to test whether the coin is biased. 
Therefore the null hypothesis in this case is that the coin is un¬ 
biased. i.c. p-=l, where p is the probability of a head (or tail). If 
our experiment gives a value of the statistic which diviates signi¬ 
ficantly from the value cf the parameter (i.e. A), the null hypo, 
thesis is contradicted and we conclude that the coin is biased. If, 
however, this deviation is not significant , the hypothesis is accepted 
ard the deviation may be attributed to sampling fluctuations. 
As another examp'e, in accepting a consignment of jute bags the 
user wants to know whether the average warp strength of the 
batch is 100 lbs. The suitable null hypothesis is that the mean 
batch warp strength is 100 lbs. and a sampling experiment will 
decide whether the hypothesis is to be rejected. Again in 



THE SAMPLING OF VARIABLFS 


337 


sampling a certain drug for immunization against a disease, the 
null hypothesis should be that the action of the drug and attack 
of the disease are independent. Neither is it assumed that the 
drug prevents the disease nor that it has bad effects and accelerates 
the spread of the disease. 

Thus a null hypothesis is a hypothesis which is tested for 
possible rejection under the assumption that it is true. It is, so to 
say, a '‘straw man” that we set up, possibly for the purpose of 
knocking down. 

By accepting a null hypothesis, we do not mean that it is 
proved to be true. This only implies that on the basis of the statistic 
calculated from the sample, we find no reason to question the 
validity of the hypothesis Nor its rejection implies that it is 
disproved. It simply means that so far as the given sample is 
concerned, it does not seem to be a plausible hypothesis. 

15*2. Sampling Distributions. If a large number of samples 
are taken from a population and a statistic such as the 
mean or the standard deviation, is calculated for each sample, 
we shall get in general a series of different values, one for each 
sample. These values of the statistic under consideration may 
be grouped in a frequency distribution. If the number of samples 
becomes larger and larger, this distribution will tend to be normal. 
Such a distribution is called a sampling distribution. 

15'3. Utility of Sampling Distribution. Let the sampling 
distribution of a statistic be represented by the continuous curve 

y=f{x). 

where a: is the variate (statistic) and y is the corresponding fre¬ 
quency. The total frequency of all the samples which give a value 
of x greater than a given value at 0 , will be given by the area to the 
right of the ordinate at x n . It follows that the probability that a 
sample chosen at random from all possible samples will give a 
velue of x greater than x 0 , is given by the area to the right of the 
ordinate at x 0 divided by the total area of the curve. 

If P denotes this probability, we have 

J* / U) dxf\ “^/(x) dx. 

Similarly, the chance that a sample would give a value of x 
lying between and is given by 

*/(*) dx f°° f(x) dx. 

J *1 J — >= 



338 


STATISTICS 


If the units are so chosen that the total area under the curve 

is unity, these probabilities are given by 

30 fx E 

/ (x) dx and f (x) dx respectively. 

*0 j x l 

Now if we take a sample and find that it gives a very low 

value of P, we are faced with three possibilities. 

(i) The hypothesis is not correct. 

(ii) The sampling is not simple. 

(iii) Some improbable event has happened. 

Generally we arc led to suspect our hypothesis provided we 
have tested our sampling technique or on other grounds have 
no reason to suspect it ; but it is always a matter of choice which 
of the above three explanations we should adopt under the given 
circumstances. 

15*4. Standard error. As in the case of sampling of attri¬ 
butes, the term ‘standard error’ is used to denote the standard 
deviation of the samplirg distribution. In general, we are justified 
in taking a range Mean ±3 a as determining the limits outside 
which the value of the parameter given by a sample probably 
does not lie. Thus the standard error is used to gauge the 
precision of an estimate and to pass judgments on the divergence 
between expected and observed values. Hence it is necessary to 
know the standard errors of various parameters which we have 
to estimate. The most important of all the standard errors is 
that of the mean and we proceed to find it in the following section. 

15*5. Expected values. We have already defined the expected 
value of a random variable or any function of a random variable as 
the average value of the function over all possible values of the 
variable. Thus if x is a discrete random variable whose distribution 
function is / (x), its expected value is given by Zxf(x) taken over 
the whole range of x and is denoted by E (x). 

In general we shall define the expected value of any function 
of x, say •/< (x), as 

E[<b (x)] =£••/. (x)/(x), 

X 

where the sum is taken over the whole range of x. Similarly, the 
expected value of any function (x,, x 2 , x 3 ,.. .x n ) of n discrete 
variates x lt x 2 , x 3 .. .x„ with distribution f (x lt x 2 ,.. .x w ) is 
defined to be 

E [</> (Xj, x 2 ,...x n )j = ^- -•••*- *b (X|, X;* ,x n ) J (X|, x 2 , • .. x„). 




THE SAMPLING OF VARIABLES 


339 


If * be a continuous variate having the distribution fix), its 
expected value is defined by 

E U)= dX - 
And the expected value of <f> (*) is given by 

0 (*)/(x) </x. 

Similarly the expected value of * (x„ x.x„) is given by 

E[ * (x " x . 

T . ^1* ^2»* • •*^n) <^*2* • • dx n * 

Two simple properties of £ are worth noting 

If c is a constant and if* (x) and * <x) are any functions of x, 

then E [c (4> (x)] = c E [* (x)] 

and E f* (x) + * (x)] = E [* W ] + £ [* (x )]. 

for hCSe tW °, relations follow f,om corresponding relations 
tor the integrals : 

J (x)/(x) dx=c j * (x)/fx) r/x, 

| lx)+* (x)]/(x) </x= J * (x)/(x) rfx+J * (x)/(x) <fx. 

JS- 6 . Moments. The moments of a distribution are the 

eLen ri , Va K UeS ° r the POWerS ° f ,herandom triable which has 
8 en distribution. The nth moment of x is generally denoted by 
Pn and is given by 

^=E (* n ) = j x*f(x) dx. 

abouI^nvTrhT 1116 " 1 ^ “ CaUed thc mcan of *• The moments 
aoout any arbitrary point a are defined as 

E [(*-«)"] = I* “ (x~a) n f (x) dx 

the mea^which a reP,aCed . by ^ mea ° ^' We get raoments about 
me mean which are usually denoted by hn . Thus 

j ^(x-ii^rfix) dx. 

of the 5 m 7 eansTf a s n amp| d es. ! " aDdard err ° r ° f ' he SamP ' inS dis ' ribu,!o " 

rand^ZZles o/ ZZ '?'* ° f ^ ° f °" 

P J Slze n f fom a Population is the mean of thc 



340 


STATISTICS 


population and the standard error is — ~ where a is the standard 

deviation of the population. [Agra M. Sc. 1952, Agra B. Sc. 1955] 

Suppose we have n statistically independent variables 
x { (i = l, 2, 3with the same probability distribution. Let us 

n 

consider the distribution E aiX t where a { is a constant. 

i=l 

Put 


Then 


X=a x x t + £ 2X2 a n x n . 

E (X) = e\ 2 a t Xi\^Z a t S {x t ) 
L: = l J i = l 


_ n 

.*. X=E a { 
i= 1 


where represents the mean of the ith set and X the mean of X. 

If we put —~ for * and all the sets x, have the same 
distribution as the population, then 

E (*,) = £■ (Xjj) = ... =£* (x n ) = p, 
where p is the mean of the population. 

Also x- -' •+*«+•••+*» 

n 

= mean of a sample of n from a population with mean p. 


Hence 



It follows that the mean value of the mean of all possible 
random samples of size n from a population is the mean of the 
population. 

For the variance of X , we have 


(X 


—A') 2 —I 2 

l» = l 


Oi {Xi—Zi 




2 a? (* t - *)*+2 2 2 a iQi (x t -X { ) (xj-X,), & j 

»=l i=l i = l 


n 


n n 


E{X-Xf= E a ( *E(x t -S !,) a 4-2 E E a { 0 )E (x,-*,) (Xj-Xj\ 

i = l i=l j=1 



THE SAMPLING OF VARIABLES 


341 


n 

E 



n n 

afaf+2 E E a^i cov (x, x } ) 
i=W=l 


= E q 2 g 2 -f- 2 E E 
i=l i=\j = \ 

where p^ is the coefficient of correlation between x, and x } . 
Since it is given that the variates are independent, pa=0 for all 
i and j. 

Hence o 2 x= 2 a i 

i = l 



Gi 2 +^2 ? ^2 l + fl 3 2c 3 2 + • • • -f o n 2 a n 



In particular, if o< = - 


for all / and all x, have 


the same distri 


bution, their mean and standard deviation shall be the same as 
that of the population. In this case, we have 

a l z = Oo 2 = .. . — c n 2 — o*, 

where o 2 is the variance of the population. 


Also X= 


1 

1 E x,=.S = tbe mean of the sample of size n. 
771 — 1 


Hence the variance of the means of samples of size n is 
given by 

Oo 2 = c 2 -f \ a 2 -f- - a 2 -+-. ..ton terms 
* n* n 2 n* 






a 

y/n- 


Thus the standard deviation of the means of the samples of 
size n varies inversely as the square root of the sample size. The 
larger the sample size, the more closely the values of the sample 
mean shall cluster to the^mean of the population. 

Note. In practical problems of statistics, we do not know 
the mean and S. D. of the whole population. We have to esti¬ 
mate these values from samples. If the sample is a fairly large 
one, we have to assume that its S. D. is the same as that of the 
whole population. 



342 


STATISTICS 


15*8. Levels of significance. As we have already seen that 
(he probability of a variate in normal distribution lying outside 
Mean ± 3a is *3 per cent or only once in 300 trials which is a very 
small quantity. Thus if a manufacturer with past experience 
knows that a normal worker in the factory produces on the 
average 400 pieces per day with standard deviation 10 and by a 
new process. 360 pieces only are manufactured in a day, we feel 
that the new process is highly unsatisfactory, since this value 
deviates from the mean by 40 or 4a. The region in which a 
sample point falling is rejected is known as the critical region or 
region of rejection. Generally, we take two critical regions, 
which cover 5 per cent and 1 per cent areas of the normal curve. 
When a hypothesis is rejected with the variate deviating from the 
mean by more than l‘96a on either side, it is known as 5 per cent 
level of significance and if it is rejected if lying outside 2*58a, 
it is 1 per cent level of significance. The probability of the value 
of the variate falling in the critical region is known as the level 
of significance. If the level of significance is taken at 5 per cent. 
2*5 per cent shall lie on the right hand side ofjc=l , 96a aud 
2*5 per cent on the left hand side of x — — l*96a (see the figure). 



The choice of the level of significance will depend upon the 
nature of the problem and is a matter of judgement for those 
who carry out the experiment. Their judgement should naturally 
be guided by the degree of confidence they have in the null 
hypothesis. If they have the firm belief that the null hypothesis 
must be true, it will require very improbable result before they 
can reject the hypothesis. On the other hand if they have no 
very strong feeling about the validity of the hypothesis, they 
may reject it on a less improbable result. For example, suppose 
in a coin-tossing experiment the null hypothesis is that the coin 
is unbiased. If the experimenter is convinced that the coin is 
not biased, the observed result might have to have a probability 




THE SAMPLING OF VARIABLES 


343 


of 0‘1 per cent, or even jess, before he can reject the hypothesis 
of no bias. 

If the experimenter is an expert to detect the property of 
bias in coins, and feels that a particular coin is biased, he would 
reject the hypothesis if the observed result only had a probability 
of 5 per cent or even 10 per cent. 

Nature of the double-tail and single tail tests. It generally 
depends on the nature of the problem whether we have to use a 
double tail or single-tail test in gauging the significance of a 
result. In the two-tail test, we 
take into consideration the area 
of both the tails of the curve 
represented by the sampling 
distribution whereas in a single 
tail test the area on the right of 
a ordinate, say at x = x 0 , is 
taken into account. For example, in a coin-tossing experiment, 
double-tail test should be used to test whether a coin is biased.’ 
Since a coin will be biased if either it gives significantly more 
number of heads than tails (this gives right tail only) or it gives 
more number of tails than heads (this gives left tail only). Most 

of the tests are of this two-tail type. In some cases however a 
single-tail test is desirable. 

Underlying assumption. A very important assumption in 
carrying out the interpretation of the results is that the sample 
averages are distributed normally. Even when the distribution 
of individuals is not normal it is nearly always true that averace 
and, to a lesser extent, standard deviations of radom samples 
from the population of individuals are approximately normally 
distributed if the sample size is not too small. The proof of the 
above statement is beyond the scope of the book. 

15*9. Means of the samples. If a number of random 
samples of size n is taken, we have seen that 




and 



a 


where X and a x are the mean and standard deviaFons of the 
means of the samples. If the population Is normally distributed 
we have also seen that the mean of the samples is normally 

distributed with mean p and standard deviation —. If wc con si- 



344 


STATISTICS 


der the variate 

X-u 

■* cr 

\/n 

then z is a normal variate with mean as zero and s. d. unity. 
On the null hypothesis that the mean of the means of the samples 
is equal to /x, the value of z should be zero, but it is not always 
so. We have to see whether the difference between the obser¬ 
vation and hypothesis is significant or merely due to fluctuations 
of sampling, i. e. whether the value of z falls inside the critical 
region or not. As already observed, if | z | > 1*96. the difference 
is significant on 5 per cent level of significance and if | z | > 2*58. 
the difference is significant on 1 per cent level. In some 
cases where the rejection of the hypothesis may mean.serious 

implications, sometimes *1 per cent level of significance is also 

adopted. _ 

15-10. Fiducial or Confidence Limits. Since 2 = - y/n 


is a standard normal variate, we have 

P {: z | < l-96}=0-95 

< — 1*96\=0*95 

y/n j 

I* < *+ l: 7"}= 0 ' 95 - 

l V '* V n I 

This relation shows that in repeated sampling, the prcoaDi- 
lity that che interval {p-^. *+^°} will include ^ is 


or 


or 


(l X-IX I 


0 95. This means that if a very large number of samples, each 
of size is taken from the population and if we determine the 
above interval for each sample, then in about 95 per cent of the 
ca<es the interval will include /x. While in the remaining 5 per 
cent cases, it will not do so. It is possible that a particular 
sample gives the limits between which the population mean does 
not lie What the above statement means is that under the same 
conditions repeated samples will produce 95 per cent results for 

which /x will lie in the above interval? The values 


and are called 95 per cent fiducial or confidence limits 

V" 

of // corresponding to the observed sample. The quantity 0*95 



THE SAMPLING OF VARIABLES 


345 


is called the confidence coefficient . Similarly 99 per cent 

confidence limits for y aie *——— and X+ -? 8<T . 

V'* y/n 

15*11. Solved Examples. 

1. A sample of 400 items is taken from a normal population 
whose mean is 5 and whose variance is 4. If the sample mean 
is 4-45, can the sample be regarded as a truely random sample ? 
Mean of the population y = 4. 

S. D. of the population ct = 2. 

S D. of the mean of the samples=—^- = —-— =*1 

F y/n V(400) ’ 


u X-y 4*45-4 

Hence z=-~ / =--- 


=4*5. 


<J I y/n *1 

Here the deviation of the mean of the sample from the mean 
of the population is 4*5 S. E. which is highly significant. 

Therefore the sample cannot be regarded as a random 
sample. 


2. The mean of a certain normal population is equal to the 
standard error of the mean of the samples of 100 from that distri¬ 
bution. Find the probability that the mean of the sample of 
25 from the distribution will be negative. (Punjab ’52) 

If the mean of the distribution is y and that of the sample 
is X, then 


_ a _ a 

' x ~V(ioo)-jo’ 

where a is the S. D. of the distribution. 
For a sample of size 25, we have 



Since X is negative, z < — 

The probability that z, a normal variate, is negative 
< —a is given by 


1 

V(2t r) 




I 

V(2 rr) 



e~te' dz 


and 


= *3085 (from the tables). 

3. The guaranteed average life of a certain type of electric 
ght bulbs is 7,000 hours with a standard deviation of 125 hours. 



346 


STATISTICS 


It is decided to sample the output so as to ensure that 90% of the 
bulbs do not fall short of the guaranteed average by more than 2*5%. 

What must be the minimum sample size ? 

Let n be the size of the sample. Since the guaranteed mean 
is 1 , 000 , we do not want the mean of the sample to be less than 
2*5% of lOOO (i e. 25) from 1000 so that it should not be below 

1000-25=975. Hence X > 975. It follows that 


zl 


\X-U 

a 


y/n 


975—1C00 \/n 

" 5 ' 


125 

y/n 


V I I v ' 

From the given condition, the area of the probability normal 
curve to the right of ^ should be ‘9 or the area between 0 and 


is -4. 

From the table of areas, we get 

^=1*281 or n=41 approx. \, .. 

’ . ’•••*■ . 

/. The sample should not consist of less than 41 bulbs. 

4 * a sample of 400 male students is found to have a mean 
height of 67-47 inches. Can it be reasonably regarded as a sample 
from a large population with mean height 67’39 inches and S. D. 
1 30 inches ? [Agra M. Sc. (Maths.) 1961] 

Here *=67*47 inches, 

^=67*39 inches and a=l*30 inches, 

«=400. 


Hence 


X-p 67-47 —67‘39 _,. 07 
“ Tr - 1-30 


y/n 20 

Here the deviation of the mean of sample from the mean 
of the population is 1*23 S. E. which is not significant. There¬ 
fore the sample can reasonably be regarded as drawn from a 
large population with mean height 67 • 39 inches and S. D. 1*30 

5 It is known that the mean and standard deviations of a 
variable are respectively 100 and 10 in the universe . 7f is however 
considerd sufficient to draw a sample of sufficient size but such as to 
ensure that the mean of the sample would be in all probability 


THE SAMPLING OF VARIABLES 


347 


within O'10% of the true value. How much would be the cost 
(exclusive of overhead charges) if the charges for drawing 100 
members of a sample be one rupee ? (I. a. S. ’47) 

Find the extra cost necessary to double the precision. 

Assuming the condition of simple sampling, the sample mean 
should not differ from the true mean by 0*01% or by 0 01 since 
the true mean here is 100. 


The S. E. of the mean of the sample = 
size is n and a the S. D. of the universe. 


_o _ H) 

y/n~ y/n' 


if the sample 


We know that in a normal distribution fx± 3c contains almost 
all the values of the variate, n being the mean of the distribution, 


» 




should be equal to 3 


or 



y/n 

or y/n= 3,000 i. e. /i=9,000,000. 

The sample size is therefore 9,000,000. The sampling charges 
are thus Rs. 90,000. 


To double the precision, we should have =3. 

10 ly/n 

This gives n = 36,000,000. 

Hence extra cost = 360,000-90,000=270,000 rupees. 

6 . Given that for an universe n=66, a=5 J, what sample size 
n must be used in order that for similar test conditions, the pro¬ 
bability that the average value of the sample will be in error by not 
more than 5% of the average of the universe shall be ? 

From the given condition, the deviation of /* = 66 from 


X (sample mean) is 5% of /z. 


i. e . 


X—(jl = 5% of 66 = 3*3. 


Hence y/ n =ly/n. 

Vn 

Now it is given that the area of the probability normal 
curve to the right of ly/n should be 5% /. e. *05, so that the 
area between 0 and ly/n is •5 — *05 = *45. 



348 


STATISTICS 


From the tables, we find that z= 1*645. Hence \y/n— 1*645 
i. e . n=(f X 1 *645) 2 =8 approx. 

7 . To know the mean weight of all JO-year-old boys in the 
state of Rajasthan, a sample of 225 is taken . The mean weight of 
this sample is found to be 67 pounds with a standard deviation of 12 
pounds. Can you draw any inference from it about the mean weight 
of the universe ? 

Here S. D. of the universe is not given but we can use in its 
place the S. D. of the sample which is given to be 12. 

Standard error of the mean= ~ n = ^^2 5 ) ~ * 8 P ound * 

Assuming simple sampling conditions, the mean weight of 
the universe would in all probability lie within three times this 
S. D. to the mean of the sample. Hence the mean weight of all 

10-year-old boys in the state lies between 

67 lbs.±2 4 lbs., i. e. 64*6 lbs. and 69 4 lbs. 

8. An industry desires to make a survey of the mean weekly 
wages of 10,000 of its workers. Since a study of all the workers is 
impossible , a representative sample of 400 workes is selected; by 
how much would the results differ from the above sample ? 


Standaid error of the mean 


2*5 


=*125 rupee. 


V* V(«00) 

If fresh samples were taken, their means would not differ by 
the mean weekly wage of this sample by more than three times 
the S. E. of the mean of the sample, that is, the mean weekly 
wages of all the fresh samples would lie between Rs. 30±*375 or 
between Rs. 29*725 and Rs. 30*375. 


9 . Suppose that the standard deviation of stature in men is 
2'48 inches. One hundred male students in a large university are 
measured and their average height is found to be 68 52 inches. Deter¬ 
mine the 98 per cent confidence limits for the mean height of the 
men of the university . 

Standard error of the mean height of 100 male students 

a 2*48 -. 0 • u 

c = =-248 inches. 

y/n V(1M) 

Now 98% confidence limits for the mean height of the men 
of the university means that 49% of the total area under the 
normal curve lies on each side of the mean i.e, *5-|-*49= , 99 
area under the standard normal curve should lie to the left of the 


THE SAMPLING OF VARIABLES 


349 


critical value of the variate z. From the table, we find that this 
critical value of z is 2*32. 


2*32, 

68*52 + 2*32 x-248 


Hence for the required confidence limits, we have 

/z —68*52 
•248 

ie. 68*52 —2'32x*248 < p < 

or 67-945 < p < 69 095. 

Hence the 98% confidence limits for the mean height of the 

men of the university are 67*945 inches and 69 095 inches. 

10. The data concerning height measurement for a random 
sample of individuals from a given population are as follows : 

mean = 172, S. D. = I2 t n=65 . 

If a large number of samples of the same size were selected at 
random from the given population, what would be the limits of 2% 
confidence interval for the true mean ? 


S. E. of the mean = = I *5 nearly. 

y/n v/(63) 7 

Now the limits of 2% confidence interval for the true mean 
means the same thing as 98°/ 0 confidence limits for the true mean. 
Hence as in the previous exercise, we have the required confi¬ 
dence limits for the mean as 

172 ± 1*5 x 2-32= 172 ± 3*48, 
l e. the limits are 168*52 and 175-48. 

11. A research worker wishes to estimate the mean of a 
population using a sufficiently large sample. The probability is 
95 per cent that the sample mean will not differ from the true mean 
by more than 25 per cent of the standard deviation. How large a 
sample should be taken. (Agra B. Sc. 1960) 

We know that the value of the standard normal variate where 
the area to the left is *95 is 1*96. Hence, we have 


or 


Also it is given that 


' oly/n 
I X-p 

I X—p 



From (l) and (2), we get 


l"96r» CT 
\'n < 4 


...d) 

...(2) 



350 


STATISTICS 


or n > 16x(l*96) 2 =62 nearly. 

12. A normal population has a mean of O'1 and a S. D. of 
2 1 . Find the probability that the mean of simple sample of 900 
numbers will be negative. - 

Here S. E. of the raean= ~ =-,-^ = *07. 

y/n 30 


... r-grg- g-O-U y -M? 

<?IV n '07 *07 

Since A' is negative, 7 < —1*43. 

The probability that AT is — ive i.e. z < —1*43 is given by 

= *0774 from the tables. 

13. The mean height of 9339 children of age 5 years is 
41'26 inches, the standard deviation is 2’238 inches . Find the odds 
against the possibility that the mean of a random sample of 100 is 
greater than 41'70. 

S. E. of the mean= ---- - t 2 J 8 =0-2238. 

V n iu 


X—ft 41*70-41*26 
o *2238 


= 1*96, 


y/n 

From the table we see that for z = 1*96, the area to the left is 
0 9750 and so the area to the right of z = 196 is 1-0 9750=0*0250. 
Hence the probability that the mean of random sample of 100 is 
greater than 41*70 is 0 0250 i e. - 4 X 0 - or that the odds against are 
as 39 to 1. 


14. Suppose light bulbs made by a standard process have an 
average life of 2000 hours with standard derivation of 250 hours. 
And suppose it is considered worth while to replace the process if the 
mean life can be increased by at least 10 per cent. An engineer 
wishes to test a proposed new process , and he is willing to assume 
that the standard deviation of the distribution of bulbs is about the 
same as for the standard process. How large a sample should he 
examine if he wishes the probability to be about •01 that he will fail 
to adopt the new process if in fact it produces bulbs with a mean life 
of 2250 hours ? 


THE SAMPLING OF VARIABLES 


351 


Since there is to be an increase of 10 per cent in the mean of 
the standard process, the mean of the new process 

=2000 x ioo hours = 2200 hours = ^. 

And S. D. of the new process=S. D. of the standard process 

= 250 hours = cr. 


Mean of the sample —2250 hours —A'. 
If n be the number in the sample, then 


X-p_ 2250 - 2200_ y/n 
Z ~cly/n 250 IV n 5 

Since the probability is to be about *01 that he will fail to 
adopt the new process, the corresponding value of z for this 
probability is 2*58. 


Hence 



or w = 25x(2*58) ? = 166 approx. 

15. An unbiased coin is thrown n times. It is desired that the 
relative frequency of the appearance oj heads should lie between 
'49 and *5/. Find the smallest value of n that wiil ensure this 
result with 90% cot\fidence. 

Now 90% confidence means that *45 of the total area under 
the standard normal curve should lie on each side of the mean. 
From the tables, the corresponding value of z is 1645. 

Also standard error of the proportion of heads 



Hence 


*5—1*645 x 



and 

These give 
or 

or 


•5+1-645 x 



1*645 

2 y/n 



1*645_ 329 
*02 4 

n= —^ =67 65 approx. 
16 


16. Mean of 10 readings on the length of a given rod is 20 inches 
The standard deviation of errors of measurement is known to be 
0*1 Inch. Does the result contradict the assumption that the length 
of the rod is 19'9 inches. [Agra M. Sc. (Maths.) 1962) 



352 


STATISTICS 


S. E. of the mean = 


0-1 


ViO 

20— 19-9 

" z= o 7 T77To =V10 > 3 ' 

Hence the difference is significant and so 
rod is not 19 9 inches. 


the length of the 


17. If the mean breaking strength of copper wire is 575 lbs. 

with a standard deviation of 8'3 lbs., how large a sample must be 

used in order that there be one chance in 100 that the mean breaking 
strength of the sample is less than 572 lbs. ? 


Here 



572-575 

8*3 



or 



...d) 


Now the probability that X < 572 is T oo = *01. Hence we 
have to seek for that value of z for which 

0 01=area to the right at the variate z. 
so that area to the left = 0*99. From the tables of areas of 
the normal curve, we find that the corresponding vaiue of z is 
2*33. Hence we get from (1), 

233 -sV"- 

which gives n = 42 nearly. 


Exercises. 

1. A simple sample of 1000 members is found to have a mean 
3 42. Could it be reasonably regarded as a simple sample 
from a large population whose mean is 3'3 cm. and standard 

*12 

deviation 2-6 cm. ? Ans. Yes, since z=— < 2. 

o Z 

2. A sample of 900 members is found to have a mean of 
3 4 cm. Can it be reasonably regarded as a simple sample 
from a large population with mean 3’2 cm. and S. D. 
2*3 cm. Ans. No. at 5% level of significance since z =2*6. 

3. Suppose that it has been determined that the average pulse 

rate of males in the 20—25 year age-group is 72 beats per 
minute and that the standard deviation is 9‘5 beats per 
minute. If a group of 55 distance runners, all in the given 
age-group, were examined and found to have an average pulse 
rate of 65, should this be regarded as a significant deviation 
from the general average ? Ans. Yes ; z~ 5*47. 


THE SAMPLING OF VARIABLES 


353 


4. The average of 400 cases is 30 and the standard deviation 
is 16. 

(a) Find the standard error of the mean and the probable 
error of the mean. Find also the probability that the average 
of the population from which the sample is drawn (b) is 
greater than 32, (c) is less than 27-5, (d) lies between 29 

and 31, and (e) does not differ from 30 by more than 2. 

Ans. (a) *8, -54 ; (b) ‘0062 ; (c) -001 ; (d) '789 ; 

(e) -988. 

5. If /> is the observed proportion of success in // independent 
Bernoullian trials, prove that the 99% fiducial limits for the 
proportion p' for large samples, are the roots of the quadratic 

(p—p 2 ) (2*58 f = n (p'—p) 2 . (Patna M. A. 1956) 

6. The grades of students in a certain course averaged 77 over 
a period of years. A class of 40 has a mean grade of 70 
with a standard deviation of 9. Can this lower mean 
be attributed to ordinary sampling variation ? 

Ans. No ; z = 4*93. 

7. The numerical grades of graduates of a large college have a 
mean of 2*83 with a standard deviation of 0 538. If the 
mean grade of a group of 36 graduates who majored in 
history is found to be 297, should this group be considered 
different from the general run of graduates ? 

Ans. No; z=l*56. 

15 12. Test of significance of the means of two large samples. 
Suppose from a normal population, with standard deviation 
o, two simple samples—one of size n x and mean s and the 
ther of size n 2 and y—are drawn. We wish to test whether 
difference between x and y, the mean of the samples, is significant 
or merely due to fluctuations of sampling. We know that the 
variance of the difference of the means of two samples is 

a*. / ^ where-^- and are the standard deviations 

V V»i « 2 / V"i y/n* 

of the means of the samples. Hence 

2= _ _ 

° V G, "GJ 

is normally distributed with mean zero and standard deviation 
unity. The probability that | z i < 1*96 is *05 and hence 
I ^ | < 1 95 on 5% level of significance. That is if the difference 



354 


STATISTICS 


between 3 and y is not significant \z j < 1 *96. If | z | > 3, it is 
highly probable that either the samples have not been taken from 
the same population or the sampling is not simple. 

If the samples are known to be taken from two normal popu¬ 
lations with means and /x 2 and standard deviations <r, and c 2 , 
then .V —y is normally distributed with mean /x 2 and standard 

deviation We get 

— < Mi M_z) 

On the hypothesis that = we have 

and the same procedure of test of significance is applied. 


15*13. Solved Examples. 

1. A random sample of 200 villages was taken from Gorakhpur 
district and the average population per village was found to be 485 
with a standard deviation of 50. Another random sample of 200 
villages from the same district gave an average population of 510 per 
village with a standard deviation of 40. Is the difference between 
the averages of the two samples statistically significant ? Give 
reasons. (P. C. S. ’49) 


On the hypothesis that the samples have been taken 
same population, we put 



from the 


Now 


so that 


.¥ = 485, Oj = 50, ^ = 200. 
? = 510, c g = 40, w 2 = 200. 


485 — 510 



(50)= (40) 2 V 
200 ^ 200 J 


-25 

4-53' 


z =5*5 nearly > 3. 

Hence the difference between the means of the samples is 
highly significant and could not have arisen from causes due to 
fluctuations of sampling. 


THE SAMPLING OF VARIABLES 


355 


2. The mean of simple samples of 1000 and 2000 are 67‘5 
and 68 0 inches respectively. Can the samples be regarded as drawn 
from the same population of standard deviaion 2’5 inches ? 

(M. Sc. Agra '52, ’63) 

Here 3=67*5, m^IOOO 

j> = 68*0, /j 2 = 2000. 

The s. d. of the population <r = 2*5. 


On the hypothesis that the samples are drawn 
population of s. d. 2*5 inches, we get 

3—y 


2 = 




67*5-68-0 


V(.ol 


1 


1000 + 2000 


) 


2 5 x -0387 


from the same 


-5 

•09675 
= 5-1. 


Hence the difference between the means is more than three 
times the standard error of the difference and so is statistically 
significant. The samples cannot be regarded as drawn from the 
same population of s. d. 2*5'. 

3. A potential buyer of light bulbs bought 50 bulbs each of two 
brands. Upon testing these bulbs, he found that brand A had a mean 
life of 1282 hours with a s. d. of 80 hours whereas B had a mean 
life of 1208 hours with a s. d. of 94 hours. Can the buyer be quite 
certain that the two brands do differ in quality. (B. A. Punjab ’61) 

3 = 1282, /f 1 = 50, <74 = 80; 
y = 1208. /i 2 = 50, o* = 94. 

On the hypothesis that the two brands do not differ in quality, 


z — 


1282-1208 


(80)* (94V 

50 + 50 


} 


74 


74 



15236\ 17-456* 

50 ) 


o=4-65 nearly. 

Hence the difference between the means is 4*65 times the 


356 


STATISTICS 


standard error of the difference of the means. As such the diffe¬ 
rence is significant and brand A is superior to brand B. 

4 Two hundred valves of each of the two makes of wireless 
valves are tested. It is found that the first make has a mean life 
3782 hours with a standard deviation of 80 hours while the second 
has a mean life 3745 hours with a standard deviation of 94 hours. 
Is there a reason for selecting any particular make. Explain 
briefly your test procedure. [You are given that if X is normal with 
mean p and variance a 2 , then P {| X — p | > T96o)=0'05]. 


Here 


2 = 


£=3782, « 1 =200, fTj=80 
7 = 3745, h 9 = 200, <t 2 = 94. 
3782—3745 37 


15236 \ 


= 4*65, 


>/(5+SF) n/(w, 

Since z > 1 *96. the difference is significant at 5 % level. Hence 
valves ofthe first make are superior to that of the second. 


5. A sample of heights of 6400 soldiers has a mean of 
67'85 inches and a standard deviation of 2'56 inches while a simple 
sample of heights of 1600 sailors has a mean of 68'55 inches and 
a standard deviation of 2'52 inches. Do the data indicate that the 
sailors are on the average taller than soldiers ? (Agra B. Sc. ’55) 
Here * = 67*85, a, =2*56, rt, =6400, 

7 = 68*55. a 2 = 2*52, n 2 =1600. 


The standard error of the difference of the 
given by 





✓ f (2*56) 2 (2*52) a l 
v I 6400 ^ 1600 / 

= Vf( • 001024) + C 003969)] 
= *07 nearly. 


mean heights is 


The difference between the raeans=7~£=*7 inches 
which is nearly ten times the standard error of the difference of the 
means and this is highly significant. Hence the data indicate that 
the sailors with greater mean height are on the average taller than 
the soldiers. 


6. Intelligence tests on two groups of boys and girls give the 
following results. Examine if the difference is significant :— 

Girls : Mean 84, S. D. 10, No. 121. 


THE SAMPLING OF VARIABLES 


357 


Boys : Mean 81, S. D. 12, No 81. 

(P. C. S. 1943) 

The standard error of the difference of the mean heights is 
given by 



%/(£+£)-n/® 



= V(2'6042)= 1-61. 

84-81 
1*61 


• • 


z = 


= 1*86 < 



Hence the difference is not significant. 

7. A random sample of 1000 men from South India shows 
their mean wage to be Rs 47 a week with a standard deviation 
of Rs. 28. A random sample of 1500 men from Punjab gives a 
mean wage of Rs. 49 a week with a standard deviation of Rs 40. 
Is there any significant difference between their mean level of wages ? 


(Raj. 1959) 


Here 


z — 


2 = 47; «j= 1000; c 1 = 28, 

y = 49; rt,= 1500; o 2 =40. 
2-y 47 — 49 


n/GSt) n/( 


28 


j.0* Y 

10LO + 1500J 


V( 1-85066/ 


i *■ 

* Z 1*37" 


1 *46 




Hence there is no significant difference between the mean 
level of wages. 

8. The mean stature and standard deviation of 1,145 boys of 
age 9 living in one-roomed houses in Glasgow are 48’60 and 
2’416 inches , the mean and standard deviations of 654 children of 
the same age living in four-roomed houses are 50'79 and 2 53 incite s. 
Find whether the difference between the means is significant. 

Here 2 = 48*60. ^ = 1,145, a, = 2*416. 




? = 50*79. «2=654, a, = 2*53. 

2- y_ 48*60-50-79 2*19 

/ 1(2/416)* i_2-53j*l = "“ • 122 
"i ) V \ 1145 ^ 654 / 



or | z 1 = 1 8 nearly. 

Hence the difference is highly significant. 

9. Two populations have the same mean , but the standard 
deviation of one is twice that of the other. Show that in samples 
of 500 from each drawn under simple random conditions, the 


358 


STATISTICS 


difference of the means will in all probability not exceed 0 3c, where 
a is the smaller standard deviation / and assuming the distribution of 
the difference of the means to be normal, find the probability that it 
exceeds half that amount . 

Let a and 2c be the standard deviations of the two popu¬ 
lations. Then the standard error of the difference of the means 
of two samples is given by 



a 2 4o 8 \ a 
500"*" 500/~ 10* 


Under simple sampling conditions, we should have 

Z—y < 3E, 

where x, y are the means of the two samples 
or ^ y 0 3a, 

i.e. the difference of the means will in all probability not exceed 


0*3 a. 

If the distribution of the difference of the means is normal, 
the chance that it exceeds 0‘15a is the area to the right of z =015 
under the standard normal curve where z is the standard normal 
variate. We find from the table that this area is *4404. 

Hence the required probability is k} approximately. 

10. A random sample of lfiOO farms in a certain year gives 
an average yield of wheat of 2,000 lbs . per acre, with a standard 
deviation of 192 lbs. A random sample of 1,000 farms in the 
following year gives an average yield of 2,100 lbs. per acre, with a 
standard deviation of 224 lbs . Show that these data are inconsistent 
with the hypothesis that the average yields in the country as a whole 
were the same in the two years. 

Supposing that the two samples have been drawn indepen¬ 
dent of each other, the S. E. of the difference of their mean 


yields is given by 

-v'Cf^wr 

= V(87-04) = 9-33. 

X-y 2000-2100 

E 


+ 


(224) 


1000 1 1000 


:■} 




100 
9'33 


9*33 

> 10 . 


Thus the difference between the means 
arisen due to fluctuations of simple sampling. 


would not have 
Hence the data 


THE SAMPLING OF VARIABLES 


359 


are inconsistent with the hypothesis that the average yields in the 
country as a whole were the same in the two years. 

11. Suppose that 64 senior girls from college A and 81 senior 
girls from college B had mean statures of 68-2 inches and 67 • 3 
inches respectively. If the standard deviation for statures of all 

senior girls is 2’43 inches, is the difference between the two groups 
significant ? 

Here 5 = 68-2 in, n t = 64, 

? = 67*3 in., n 2 = 8l, 

0 = 2-43. 


S. E. of the difference between the means 

=ct \/ G-4> 243 G^st) 1 =°' 407 - 

Now we set up the null hypothesis that there is no difference 
between the population means i.e. t< l -/i 2 = 0. 

S-y 


z = 


V (n. + nj 


68-2-67-3 

0-407 Z Z1 * 


Since the calculated value of z is greater than 1 96, our null 
hypothesis is contradicted on the 5 per cent level of significance. 

It follows that the senior girls from the two colleges differ in 
their mean heights. 


Exercises 

1. From the data given below, find out whether there is a real 

difference in favour of the soldiers of Scotch extraction in 

matter of weight, or is the difference so slight that it may 
be attributable to chance : 


Soldiers of Scotch extraction 
Soldiers of French extraction 

[ Ans - z =^ > 3. Hence 


1821 144 93 lbs. 17-41 lbs. 
746 142*16 lbs. 16*04 lbs. 

the difference in weight is a 


real one in favour of the soldiers of Scotch extraction. J 

2. A random sample of 1000 men from Northern India shows 
their mean wage to be Rs. 2/8 per day with a standard 
deviation of Rs 1/8. A sample of 1500 men from Southern 
India gives a maen wage of Rs. 2/11 per day with a standard 
eviation of Rs 2. Discuss the suggestion that the rate of 


STATISTICS 


wages varies as between the two regions. 

j^Ans. z— < 3. The suggestion that the mean rate of 

wages varies as between Northern and 


Southern India is incorrect 


•] 


If (0 new entrants in a given university are found to have a 
mean height of 68*60 inches and 50 seniors a mean height of 
69*51 inches, is the evidence conclusive that the mean height 
of the seniors is greater than that of the new entrants ? 
Assume the standard deviation of the height to be 2*48 inches. 


r 


Ans. z= 


•91 

•47 


C 2. It cannot be said that the mean height 

cf the seniors is greater than that of the new entrants.J 

Given ^=200, mean m 1 =50, ^=4; find a mi 
Given w 2 = 300, mean nu = 51, a a =4*2; find 

Find also the standard error of the difference between these 
means, aDd hence the probability that the difference might 
occur between random samples from the same population. 
[Ans. =*2828, a m =*24249. S. E. of the difference of 

k i w in j 

m l and m B =*372, / > =*0076.] 
In a wheat variety test conducted over a wide area, the mean 
difference between two variates was found to be 5*5 bushels 
to the acre. The standard error of this difference was 14 
bushels per acre, and was determined from 100 pairs of plots. 
Set up the fiducial limits at the 5% probability level for the 
mean difference in yield between the two varieties. 

[Ans. (5*5± 1*96x 1*4) bushels, i.e. 2*756 bushels and 8*244 

bushels.] 

A random sample of 1000 farms in 1949—50 in Punjab 
gives an average yield of wheat of 800 lbs. per acre with a 
standard deviation of 62 lbs. Another random sample of 
1000 farms in the same year gives an average yield of 760 lbs. 
with a standard deviation of 50 lbs. Is the difference between 
average yields signifiant ? 

|^Ans. z== 2*^18 = ^ a PP* ** ence *k e difference is significant .J 

What is meant by standard error ? Explain under what 
conditions is it useful in tests of significance. 


THE SAMPLING OF VARIABLES 3g( 

an !,TV a 7 , ,eS . 0f l00, ' ndividuals each the following means 
and standard deviations of heights were obtained. 

Mean Standard Deviation 
Sample 1 : 66 * 6 - 

Sample 2 : 69* 5 " 

fi famine whether the observed difference in means is signi- 
C3D * (Nagpur B. Sc. (Pass), 1956) 

|^Ans. Yes; z=~ > 3 ."] 

• 15 'f 4, Standard Errors of Other Parameters. Bdow we 

give without proof the standard errors of various constants when 
the parent universe is assumed normal. * 

Parameters 


(i) Quartiles 

(ii) Median 

(iii) Standard Deviation 

(iv) Variance 

(v) Coefficient of Correlation 

(vi) 

(vii) /i t 

(viii) Coefficient of Variation V 


(ix) 


Standard Errors 
1*36263 

1*25331 




V (2 n) 


2V*\ y 

Wj^VTn 

approx. 


Difference of the two means //a, 5 * a 2 PP r °x. 

of correlated series V ~ 1<T *- \ 

ssirr •^-sarr? s 

Ttssssz 

ard deviations of two correlated series is e 


>/<£ 


<7 


2 r. 


< 7.(7 


0 


2 V(n i n i ) i 

15*15. Solved Examples. 

encenLi? /l ? . Children who were ™ffcrlng from the after-effects of 
"cephalitis iethargica were tested twice / a, their first test their hit 


362 


STATISTICS 


intelligence^ratio was 84'60, the standard error of the mean being 
1-228; at their second 1 test the mean was 73'82and the standard error 
of the. mean 1‘361. There was a correlation of *764 between the 
results of the two sets of tests. Find whether the difference between 

the means was significant. ' 

The S. E. of the difference of the means 

= V[(l*228) 2 +(l*361) a -2 (-760) (1*228) (1*361)] 

= 898. 

The difference between the means is 10*78, which is 


—or about 12 times the standard error. Hence the difference 
•«98 1 

between the means is highly significant. The odds against the 
difference being due to chance are therefore overwhelming. 

2. From the data given below, find out whether the difference 
between the standard deviations of the two samples is a real one. 


Wj = 1,392; a 1 =53*84; « 2 =630; a 2 =56*56. 
Here cr 1 —a 2 =2*72. 

And S. E. of a x — c 2 is given by 

r. !(o <r 2 *\ //(53 84) 2 (56-56) 8 \ 

V U + 2»»r V \2 (1392)"'2 (630) ) 



Hence the difference is not significant. From the tables, we 


find that the area to the left of-=T44 is *9251. Hence the chance 

G 

of reaching or exceeding this value in random samples of the same 

population is 2 (1 —*9251) = *15=i 1 0 fi o-. 

Ex. 3. In an intelligence test administered to 60 fathers and 

their 100 children , the following results were obtained. 

Mean Score S. D. 

Fathers 114 13 

Sons HO 11 

Assuming the r between the two to be +’75, calculate the standard 
error of the difference of the two means and state whether the 
difference is significant. 

S.E. of the difference of the means of the correlated samples is 


given by 



1 

vVv ? 2) / 


THP SAMPLING OF VARIABLES 


363 


- //(13)* (I 

vr»r + i 

= V(1*25) = 1M2. 


(13| 2 U1)* 13 11 1 

60 T 100 #3 V60 * V100 J* 


. __■»—y 114—110 4 ^ , 

E 1*12 1*12 > 3 * 

Hence the difference is significant. 

Exercises 

Two samples of 100 and 80 students are taken with a view to 
find out their average monthly expenditure. It is found out 
that the median monthly expenditure for the first group is 
Rs. 85 and for the second group is Rs. 100. The standard 
deviation for the first group is Rs. 7 and for the second Rs. 8. 

Examine if the difference between the median monthly 
expenditure of the two samples are statistically different. 

Ans. Ycs;2«r~ >10 


J^Hiot. 


Standard error of the difference of medians 


In a sample of 1000, the mean is found to be 17*5 and the 
standard deviation 2*5. In another sample of 800, the mean 
is 18 and the standard deviation 2*7. Assuming that the 
samples are independent, discuss whether the two samples 
may have come from universes which have the same standard 

deviations. Ans. Yes: z=-^„ < 3 . 


Aos. 


YCS; 2== *088 


JVlint. Standard error of the difference of the standard 
deviations of the two samples 

= VG "; 2+ £) =0 ' 088 

and difference of standard deviations = 0*2 


so that 


02 

2 0*088 < 5 


CHAPTER XVI 


^-DISTRIBUTION 

16*1. Introduction. In the last two chapters, we discussed 
some tests of significance based on standard error. Thus we 
tested the sample mean $ against the population mean /*. We 
also tested the significance of the difference between two sample 
means. As a matter of fact, we were judging from samples the 
relationship between fact and theory. These tests were all based 
upon the fact that we could assume that their distribution was 
normal. In the present chapter, we shall study a distribution 
which will enable us to compare not merely one value, but a 
whole set of sample values with a corresponding set of hypotheti¬ 
cal ones. This distribution, called x a - distr >bution, was first 
discovered by Helmert in 1875. Karl Pearson derived it indepen¬ 
dently in 1900 and applied it as a test of 'goodness of fit’. 

16*2. Definition of x 2 » If / 0 denotes the observed frequency 
and f t the corresponding expected frequency of a class interval or 
cell, then we define x 2 by the relation 

•••<') 

where the summation extends to the whole set of class intervals 
or cells. Another useful form is given by 

...( 2 ) 

where N is the total frequency. 

16 3. nggrpes of Freedom and Constraints. Suppose the 
individuals of a sample are grouped into n classes or cells. The 
theoretical frequencies are calculated on the supposition that the 
data obeyed a certain law of distribution such as for example the 
Normal or Binomial laws. In forming this theoretical distribution, 
we may have to use one or more constants calculated from the 
sample. Let the number of such constants be c. We then define 
the number of degrees of freedom v by the relation 



v — n — c. 


...(1) 


^-DISTRIBUTION 


365 


Hence to find the number of degrees of freedom, we must 
subtract from the total number of cell frequencies, the number 
of constants used from observed data in finding the theoretical 
distribution. Thus if the totals have been made equal to n both 
in theoretical and observed distributions, we say one constraint 
has been imposed on the data and consequently the number of 
degrees of freedom is n—\. Again if in addition to totals, the 

means also have been made equal, there are two constraints and 
n-2 degrees of freedom and so on. 

Cakulalion.of v for a p x q co ntingency tabl e. 

In calculating theoretical frequencies in a contingency table, 

we impose the limitations that the row totals, the column totals 

and the grand total remain unaltered. Thus each of p columns 

and q rows imposes a constraint, giving p+q constraints. But 

the sum of the border columns and the sum of the border rows 

each must be equal to the grand total. Thus there are only 

(P+q — 1) constraints. Hence the number of degrees of freedom 
is given by 


v—n—c 


=pq—(P+q— 1) 

= (/>-!) (q-l). 

16*4. Sampling distribution, of **. It can be shown that 

under certain conditions, the distribution of ** for large samples 
is given by ^ 


= y 0 e ~ A ' 12 y v “ 1 


...d) 


where 


2 (v-2)/2j* 


(O' 


Since x 2 is the sum of squares, it cannot be negative. The 
curve given by ( 1 ) will therefore extend from Otocc. The pro- 

observL 8ett ‘ n8 3 Value of * a as 8 rea t or greater than an 

right of th* U h ’ IS lhe area of th e curve given by (I) to the 
right of the ordinate at * 0 divided by the total area of the curve 
t, e. it is given by 


d x 


fry*-* V-w* 

16*5. Condilifinsfor the application of x » test. 

folio win R^r^iaii f' ** d,str,but,on as a tes t of significance, the 
g precautions are necessary to observe : 


366 


STATISTICS 


(a) The total frequency, N f should be fairly large. It should 
be at least 50, however small the number of cells may be. 

(b) No theoretical cell frequency should be small. It should 
not be less than five. It is better if the smallest cell frequency is 
10 or greater. When the cell frequencies are small in two or 
more cells, they should be combined into a single cell. 

(c) All the individuals in a sample must be independent. 

(d) The number r., of classes or cells, should be neither too 
small nor too large. It is better if 5 ^ n 20. Sometimes the 
number of cells less than 5 can also be used provided the cell 
frequencies are not small. 

(e) The constraints imposed on the cell frequencies must 
be linear. 

16*6. Properties of x a -distribution. 

The following properties of the curve given by (1) of § 16’4 
are worth noting : 

(i) If v= 1, the curve is given by 

y=y 0 e ~ 7 ' 12 , 

which is the standard normal curve for positive values of the 
variate. 

(ii) When v > 1, the curve is tangential to x-axis at the 
origin. Again, the curve attains' its maximum value when 



i. e. y 0 .(*’-1) x v- 2 +e-z 8/2 (_ x ) X -1]=0 

or y 0 e- 7 ' 12 . X v “ 2 fx*—( v 1 )] = 0 

/. e. x f =r — 1. 

When v > 1, the curve falls more slowly and ultimalely 
approaches zero as x 2 tends to infinity. Thus the curve is positively 
skew towards higher values. 

fiii) Fisher has shown that when v is large, V(2x*) is distri¬ 
buted approximately normally about a mean \/(2v —1) and unit 
standard deviation. 

For values of v > 30, the approximation of the curve to 
normality is quite good so that for these values of v, the tables 
of the normal curve must be used %o test the significance of the 
difference between fact and theory. 

(iv) Since the equation of the curve does not involve any 
parameters of the population, x * test is sometimes considered a 


X 2 -DISTRIBUTION 


367 


non-parametric test. Thus the x 2 distribution does not depend 
upon the form of the parent population. It is due to this property 
that x a distribution is found useful in such a large number of 
problems. 

16*7. Levels of significance. Tables have been prepared for 
the values of P, the probability of getting a value of x“ > Xo“ 
where x 0 2 is an observed value. From these tables, we can find 
the value of P corresponding to an observed value of x' and then 
proceed to test whether the difference between observed and 
theoretical frequencies is significant or not. Smaller the value 
of P, greater the divergence between fact and theory so that 
small values of P lead us to suspect the hypothesis. Not only 
small values of P lead us to suspect the hypothesis but a value 
of P very near to unity may also lead to a similar result. Thus 
if P= 1. x 2 = 0 showing that there is perfect agreement between 
fact and theory which is a very improbable event. There are two 
conventional levels of significance. 

If P < 0*05, we say that the observed value of x* i s signifi¬ 
cant at 5 per cent level of significance. 

Similarly if P < 0*01, the value is significant at 1% level. 

16*8. Solved Examples. 

1. In 120 throws of a single die , the following distribution of 
faces was obtained. 

Faces 1 2 3 4 5 6 Total 

fo : 30 ■ 25 18 10 22 15 120. 

Do these results constitute a refutation of the “equal probabi¬ 
lity" (null) hypothesis ? 

On the basis of equal probability i.e. J, the theoretical 

frequencies for each face will be ^°=20. 

6 

Hence x , = (3 0-20)‘ (25-20,‘ #18—20>* 

^ 20 + 20 + 20 

(10-20)- (22—20)- (15-20)3 

+ 20 + 20 + 20 ' 


[100 + 25 - 1 - 4 + 100 - 1-4 + 25 ] 


12 * 90 . 


Now 


129 
10 5 

v—6— 1 = 5. 



368 


STATISTICS 


For 5 degrees of freedom, we get from the tables, 

X 2 = l 1*070 at 5 per cent level. 

Since the calculated value of ** is greater than this value, 
the hypothesis of equal probability is rejected. Consequently, the 
die is biased. 

2. Five dice were thrown 96 times and the numbers of times 
4, 5 or 6 was thrown were : 

No. of dice showing 4, 5 or 6 : 5 4 3 2 1 0, 

Frequency: 8 18 35 24 10 1 . 

Find the probability of getting this result by chance. 

The probability of getting a 4, 5 or 6 in a single throw of a 
single die is £. Hence the theoretical frequencies are the succes¬ 
sive terms of the binomial expansion 

96 (*+*)» 

i.e. 96 [(i) 6 +5.(A)M + 10 ($) 3 (*) 2 +10 (*)* (*) 3 +5 (h) (|) 4 +U) 6 J 

or 3, 15. 30, 30, 15, 3. 

Since the border frequencies are less than 5, we combine 
them with the adjacent frequencies as follows :— 


/•: 

26 36 

24 

11 

f : 

18 30 

30 

18. 

1—| pnpp v": 

(26-18) 2 (35 — 30)* 

(24 —30) 8 

(11 — 

iltiiww y 

18 + 30 

+ 30 + 

18 


64 25 36 49 



= 

~ 18 + 30 + j6 + 18 




= 8*31. 
v = 4 —1 = 3. 

For three degress of freedom, the value of P corresponding 
to x 2 = 8*31 is found from the tables to be 0*041. 

Hence the probability of getting as bad or worse a fit in 
random sampling is -jVo t.e. about 1 in 25. 

3. In experiments on pea-breeding, Mendel obtained the 
following frequencies of seeds : 

Round and Wrinkled and Round and Wrinkled and Total 
yellow yellow green green 

315 101 108 32 556 

Theory predicts that the frequencies should be in proportions 
9:3:3: 1. 

Examine the correspondence between theory and experiment. 

(Agra M. Sc. 1952) 


^-DISTRIBUTION 


369 


The corresponding theoretical frequencies are 

A x 556, A x 556, A x 556 and A x 556 
/.e. 313, 104, 104 and 35 respectively. 


Hence x* = 


(315-3131* (101 —1041* ( (108-104) 2 + (3 2 —35) 2 


313 

4 9 


+ 


16 


104 

9 


+ 


104 


35 


313"*" I04'*"104 + 35 


= 0*51 

and v = 4—1=3. 

The value of x 2 for 3 degrees of freedom at 5% level of signifi¬ 
cance is 7*815. Since the calculated value is much less than this 

value, there is a very high degree of agreement between theory 
and experiment. 

4. Find the value of chi-square for the following table : 

Class A B C D E 

Observed frequency 8 29 47 15 

Theoretical frequency 7 24 38 24 

(Agra M. Sc. 1951, 55) 

Since the border frequencies are less than 10, we regroup the 

data as follows. 

A + B 
37 
31 

(37 —31 ) 2 (44_— 38) 2 

31 * 38 

36 36 144 

-_ 1 _ 

31 


Class 

fo 

f. 


Hence x 2 


+ 


C D + E 
41 19 

38 31 

(19-31) 2 
31 


31 + 38 + 


= 1*16 + 0*95 + 4*64 


= 6*75 approx. 

5. Genetic theory states that children having one parent of 
blood type M and the other of blood type N will always be one of 
the three types M,MN,N and that the proportions of three types 
will on average be 1 : 2 .* 1 . A report states that out of 300 children 
having one M parent and one N parent, 30°i were found to be type 
M, 45% type MN and remainder type N. Test the hypothesis by 
X a test. 

Observed and theoretical frequencies arc as follows : 

Type 


M 


h 


30 

foo 


x 300 


90 


MN 

z * 300 

= 135 


”d * 300 

=75 



370 


STATISTICS 


ft £x300 1x300 £x300 

= 75 = 150 =75. 

Hence , (135-150)* (75-75)* 

* 75 +150 -+-75~ 

-3+1-5+0-4-5 

and v = 3 — 1 = 2 . 

- Q 0 f he V3,Ue ° f * 2 for two de 8 rees of freedom is found to be 
I , wh,ch is greater than the calculated value. Hence the 

ypo hesis holds true i.e. the genetic theory appears to be 

wv/x I 

6 . Four different makes of machine tool were tried at the 
me time over a period of some months and the number of recorded 

Jfr 68 f ? r 6aCh type Were 18 ' 12 > I7 ' J5 - Are there real 
Jjerences between them in respect of breakage ? 

Here the hypothesis to be tested is that there is no difference 
ween t ie four types. In all there are 62 recorded breakages, 
so that on the average the number of breakages for each type of 
tool would be 15*. if the hypothesis were true. 

Since there is only one constraint imposed by making totals 
o theoretical and observed frequencies agree, the number of 
degrees of freedom will be 4—1 = 3 . 

Hence v o = (J 8 -I5*5) 2 (12-15*5)’ 

A 15*5 + TPT~ 

, (17-15-5)* . (15-15-5) 2 
15*5 + 15-5 

= 0-40-f 0-79 + 0* 15 + 0*02 
= 1-36. 

The 5% value of for v=3 is 7*81. Since the calculated 
value of is much less than the table value, the hypothesis is not 
rejected and this result is not significant of any real differences 
between the machine tools in respect of breakage. 

7. Twelve dice were thrown 4096 times and a throw of 6 was 

reckoned as a success; the observed frequencies were as given 
below :— 


Number of 

successes 0 1 2 3 4 5 6 7 and over 

Frequencies 447 1145 1181 796 380 115 24 8 

Find the value of x z on the hypothesis that the dice 


Total 

4096 

were 



X*-DISTRIBUTION 


371 


unbiassed and hence show that the data are consistent with the 
hypothesis so far as the x 2 test is concerned. (Delhi 13. A. Hons. ’54) 

The chance of success is £ and that of failure is §. Hence 
the theoretical frequencies of 0 , 1 , 2 ..., 12 successes are the 
successive terms of binomial expansion of 4096 (§+»)**• 

These are found to be as follows :— 

Number of 

successes 0 1 2 3 4 567 and over Total 

f e 459 1102 1212 808 364 116 27 8 4096 

TT 2 (447-459 ) 2 , (1145- 1 102> 2 , (1181 — 1212)- 

Hence x*- 459 -+-nol-+-m2 

(796-808) 2 (380-364) 2 (115-1I6) 2 

+ 808 + 364 116 

+ (8-8) s 
.s 

= 5-811. 

Here v = 8—1=7. 

The 5 per cent value ofx 2 forv = 7 is found to be 14 07. 
Since the calculated value is much less than this value, it follows 
that the dice were unbiased so far as the x 2 test is concerned. 

8. Records taken of the number of male and female births 
in 800 families having four children are as follows :— 


Number of 

Number of female 

Number of 

male births 

births 

families 

0 

4 

32 

1 

3 

178 

2 

2 

290 

3 

1 

236 

4 

0 

64 


800 

Test whether the data are consistent with the hypothesis that the 
binomial law holds and that the chance of a male birth is equal to 
that of a female birth, namely that q=p—\. You may use the 
table given below :— 

Degrees of freedom 1 2 3 4 5 

5% value of Chi-square 3-84 5‘99 7‘82 9'49 1107 

(Agra 13. Sc. ’56) 

The theoretical frequencies of 0, 1, 2, 3, 4 male births are 
the successive terms of the binomial expansion 

800 (* + |) 4 



372 


STATISTICS 


= 800 [(§) 4 +4.(|)*.(i) + 6 (*)* (4)*+4 (i) (£)*+(*)«] 
=^[ 1 4-4+6+ 4 +l] 

= 50+200+300+200+50. 

Thus the corresponding theoretical frequencies are 50, 200, 
300, 200, 50. 

Hence v*= ( 32 - 50 >* + 0 ™ ~ 200)* (290 - 300)* 

* 50 ^ 200 300 

^(236-200)^(64-50)* 
200 50 


600 


[18 X 18 x 12 + 22x22x3+100x2 


11780 589 


+ 36x36x3+14x 14x12] 


= 19*63. 


600 30 " 

Number of degrees of freedom =5 — 1=4. 

The 5 per cent value of x 8 for v = 4 is given to be 9’49. 
Since the calculated value is much greater than this value, the 
hypothesis of binomial law does not hold for the given data. 

9. As a daily passenger in a bus I have examined the year of 
issue of coins received by me as change. They have the following 
distribution :— 

Year 1950 1951 1952 1953 1954 

Frequency 20 40 100 100 40 

Does this indicate that the coins in circulation of different 

years are nearly equal ? Apply a proper test and interpret the 

observed departures. 

[Note. The following is the table of 5 per cent values of x 2 :— 
d. /• 5 per cent value of x z 

1 3-84 

2 599 

3 7-81 

4 9-48 

5 11-01] 

(Nagpur B. Sc. Pass ’56) 

We take the hypothesis that the coins in circulation of 
different years are equal, /. e. 60 for each year. 


Hence 


(20-60)* (40-60)* , (100-60)* 
60 ^ 60 + 60 


, (100-60)* (40-60)* 

60 + 60 



373 


X 2 -DISTRIBUTICN 


= ^ [1600 + 400+1600+1 600+400] 

__56C0 93 
60 


and degree of of freedom =5— 1=4. 


The 5 per cent value of * 2 for v=4 is given to be 9*48. 
Since the calculated value is far greater than this value, the 
hypothesis is wrong i. e. the coins in circulation of different years 
are not equal. 


10. The normal rote of infection for a certain disease in 
cattle is known to be 50 per cent. In an experiment with seven 
animals injected with a new vaccine it was found that none of the 
animals caught infection. Can the evidence be regarded as 
conclusive fat the 1 per cent level of significance) to prove the 
value of the new vaccine ? (I. a. S. ’42) 

We have 


Infected Not infected 

fo 0 7 

ft l l 


Hence x .=<£rJ 

2 2 

1 per cent value of x 2 for v=l is 614. Since the calculated 
value is greater than this value, the difference between fact and 
theory is significant, /. e. the new vaccine is effective. 

11. In assigning grades to students, a teacher tries, in the long 
run, to distribute them as follows :— 

A’s—10 per cent, B's—20 per cent, C's—40 per cent, 

D’s—20 per cent, E's—10 per cent. 

In a certain class of ICO students, he found that he had distri¬ 
buted the grades as follows :— 


A’s—8 per cent, B's—22 per cent, C's—46 per cent . 
D's—10 per cent, E's—14 per cent. 

Is this class a usual one ? 


We set up the following 
frequencies of different grades 

table 

• 

for 

observed 

and expected 

Grades 

A 

B 

C 

D 

E 

Total 

fo 

8 

22 

46 

10 

14 

100 

f. 

10 

20 

40 

20 

10 

ICO 



374 


STATISTICS 


Hence 


2 (8-10) 2 , (22-20)* (46-40) 2 
* 10 ^ 20 40 

(10-20 ) 2 , (I4-10 ) 2 
+ 20 10 


=0-4+0*2+0*9+5*0+l-6 

= 8 * 1 . 

And v =5—1=4. 

5 per cent value of x 2 for v =4 is found from the tables to be 
9*488. Since the calculated value is less than this value, the 

difference between fact and theory is not significant i. e . the class 
is a usual one. f 

Exercises 

1. 200 digits were chosen at random from a set of tables. The 
frequencies of the digits were :— 

Digit 0123456789 

Frequency 18 19 23 21 16 25 22 20 21 15. 

Use the x*-test to assess the correctness of the hypothesis 
that the digits were distributed in equal numbers in the 
tables from which these were chosen. (Agra M. Sc. 1947) 
[Ans. x 2=c4 * 3 * The 5% value of x 2 for v — 9 is 16*919. 

Hence the hypothesis appears to be true.] 

2. The following judgements were classified into six categories 
taken to represent a continuum of opinions :— 

Categories 

I II III IV - V VI Total 

Judgements : 48 61 82 91 57 45 384. 

Test given distribution versus “equal probability'’ 
hypothesis. 

Ans. x 2== 27. 5% value of x 2 f° r v== ^ * s 11*070. Hence 

the hypothesis of equal probability is discarded. 

3. A die was thrown 300 times and gave the following distribu¬ 
tion for faces : — 

Faces 1 .2 3 4 5 6 Total 

/ 0 41 62 49 53 37 58 300. 

Test the hypothesis that the die is unbiassed. 

[Ans. x 2 — 6 ‘76. Also the tabulated value at 5% level for 

v=5 is 11*10. Hence the die is unbiassed ] 

4. In a certain experiment readings had to be made from a 
scale which was graduted in tenths of an inch. A particular 
observer took 1000 readings and an analysis was made of 
the frequency of occurrence of the final digits of the obser- 



375 


^-distribution 


vations, those showing the tenth of an inch for which each 

reading was taken. The frequency distribution of these 
digits was as follows : 

Digit: 0 1 2 3 4 5 6 7 8 9 Total 

Frequency : 151 79 95 109 50 185 67 98 110 56 1000 . 

There was no reason in the experiment for any particular 

final digits to occur more frequently than any others 

Could this observer be regarded as reliable in readin- the 
scale ? ° 

[Hint. Here each f e is 100 ] 

[Ans. x a =16002. The 5% value of X 2 forv = 9 j s ?6 919 

Hence the observer is very unreliable It is to be 

noted that he has an undue tendency to record readings endin- 
in 0 or 5 ] c 

5. The following table gives the number of aircraft accidents 
that occurred during the various days of the week. Find 
whether the accidents are uniformly distributed over the week. 

^ a ys Sun. Mon. Tue. Wed. Thu. Fri. Sat Total 

No. of accidents 14 16 8 12 II 9 j 4 84 


(I. C. A. R. Delhi 1951) 

lAns. Accidents appear to be uniformly distributed over 

all days of the week ] 

16-9. Test of goodness of fit. One of the principal us-s of 
the chi-square distributions is to test how well an observed 
frequency dislribulion fits a theoretical distribution 

‘ hlS Way : “ is called a test of the “goodness 
01 m . The expression written in inverted commas is 
used in two ways. Firstly it may mean the ‘lit’ of the observed 
data to the hypothetical ones. Secondly, we may u,e it to test 
the merits of a particular formula or a particular curve in 
graduating a set of values without reference to a hypothesis. 

or example, we may test how well a Normal curve or a Poison 
curve fits the given data. Calculations in both the cases are 
exactly on the same lines. We have already solved a few 
examples for testing the goodness of fit of the observed to th» 

fi .L PO fi h !, tiCa, data ' In t,1CSe CaseS the P°P u, ation was completely 
pecitied, there being no unknown parameters in its distribution 
he only limitation was that the tables of the observed and 
ypothetical data must agree. Below we add a f evV m-»r 
examples for testing the goodness of fit where (he expected 



376 


STATISTICS 


frequencies are not completely determined by the hypothesis but 
depend upon some unknown parameters as is the case when the 
hypothesis states that the population is of the normal type with¬ 
out specifying ^ and a. In such cases the number of degrees of 
freedom is reduced. As before, we may calculate P to find how 
good the fit is. We generally regard low values of P as des¬ 
cribing a poor fit. Very high values of P will give an excellent 
fit provided we use the term in the second sense described above 
(i. e. for assuming the closeness of the curve to the data). But 
when we are testing a certain hypothesis, a very high value of 
P will not necessarily establish the truth of the hypothesis. In 
these cases the hypothesis is as definitely disproved as in the 
case when the value of P is very low. 

16 10. Solved Examples. 

1. In a biochemical experiment 20 insects were put into each 
of 100 jars and were subjected to a fumigant. After three hours the 
number of living insects in each jar was counted. The distribution 
is shown below . Fit a Binomial distribution to the data and test 
the goodness of Jit 


No. of insects alive, x : 0 1 2 3 4 5 6 7 8 9 Total 

No of jars, f 0 :3S II 15 16 14 12 11 9 1 100. 

The total number of insects which were put into jars was 
20 X 100 = 2,000 and of these, the number of insects surviving was 
Sxf 0 — 439. On the hypothesis that every insect had an equal 
chance of surviving, the proportion of the survivors would be 

t 4 o 3 o e o ._Q.2i95 # Hence the Binomial distribution fitted to the data 
is ICO (0*7805 +0'2195)*°. The successive terms of this expansion 
give the expected frequencies which are calculated and shown below : 
x o 1 2 3 4 5 6 7 8910 11 Total 


/o 

fc 



3 8 11 15 16 

0*7 4*0 10 6 17*8 21*3 



_A_ 

14 12 11 9 1 0 0 

19*2 13*5 7*6 3 5 1*3 04 0‘1 



100 

100 

100 


We regroup the frequencies so that no class contains less than 
10. Regrouping is shown by means of brackets. 


Hence 

(22— 15*3) 2 , (15 —17*8)* ( (16-21-3)* , (14-19 2)* 
X"- 15-3 + 17*8 + 21*3 + 19-2 



377 


X 2 -DISTRIBUTION 

= 2*93 + 0*44 + 1*32 + 1*41 + 0*17 + 5 09 
= 11*36. 

To find the number of degrees of freedom we see that there 
are two constraints imposed by two statistics calculated from 
the sample, namely, the total 100, and the estimated value of p. 
Hence number of d.f. =6 — 2 = 4. 

The 5 per cent value of x a f° r l, = 4 is found to be 9‘49. 
Since the calculated value is greater than this value, the Binomial 
distribution gives a poor fit to the observed data. 

2. Fit a normal distribution to the data given below and test 

the goodness of fit. 

Total 

Interval mid points : 100 95 90 85 80 75 70 65 60 55 50 45 

Frequency: 0 1 3 27 12 10 95320 54 

The calculation of theoretical frequencies has been shown in 
the example of § 8*23. These are found to be as follows : 

x 100 95 90 85 80 75 70 65 60 55 50 45 Total 

/„ 0*2 0*6 1*8 4 1 7*3 10 1 10*7 8 9 5 7 2 9 1*1 0-3 53 7 

(0*4) (04) 

In order that the total observed and expected frequencies 
may agree, we make the border expected frequencies as 0*4 and 
0*4 instead of 0*2 and 0*3 as shown in brackets above. 

Again we regroup the data so that no class frequencies are 
less than 10. This regrouping is shown below :— 

/ 0 : 13 12 10 19 =54 

/, : 14*1 10*1 10*7 19 =54. 

„ . (13—14-2)* (12-10*1)2 (10-10*7)= (19—19)* 

Hence x ---+ ~lcFl + T 0 * 7 + "19 " 

1*44 3*61 0*49 

= 14*2 + i0*l + 10*7 
= 0*101+0*357 4-0-046 
= 0*504. 

To calculate the number of degrees of freedom, we see that 
one limitation is imposed by the fact that the total observed and 
total estimated frequencies are made to agree. Also we estimated 
both mean and standard deviations of the normal distribution 
from the data of the sample. Thus we have three constraints 
and consequently there is 4 — 3= 1 degree of freedom. 

The 5% value of x a for •'■“1 is 3 * 841. Since the calculated 



378 


STATISTICS 


value is much less than this value; the result is not significant 
and hence the normal distribution gives an excellent fit to the 
observed data. 

3. In 1,000 extensive sets of trials for an event of small pro¬ 
bability the frequencies f 0 of the number x of successes proved 
to be ; . 

x: 0 1 2 3 4 5 6 7 Total 

fo : 305 365 210 80 28 9 2 1 1.000 . 

Can this distribution be fitted by a Poisson distribution ? 

The theoretical frequencies for the Poisson distribution are 
calculated in Ex. 15 of § 8*12. These are as follows :— 

x 0 1 2 3 4 5 6 1 Total 

f e 301*2 361-4 216*8 86*7 26 0 6’2 1*2 0*2 999 7 

(0-5) 

In order that the total frequencies may agree, we take the 
last theoretical frequency as 0*5 instead of 0*2. We regroup 
the data so that no frequencies are less than 10 as follows :— 
f 0 305 365 210 80 28 12 =1000 

fe 301-2 361-4 216-8 85*7 26*0 7*9=1000. 

2 _ (305-301-2)* . (365 — 361*4)* . <210—216*8)* 

C x - 301-2 + 361-4 + 216 8 

(80-86-7) 2 (28-26)* (12-7*9)* 

+ 86-7 + 26 + 79 

= 0-048+0 036 + 0*213+ 0-518 + 0-154 + 2 128 
= 3*097. 

Since the mean of the theoretical distribution has been esti¬ 
mated from the sample data and the totals have been made to 
agree, there are two constraints so that the number of degrees of 
freedom is 6 — 2 = 4. The 5% value of x * for v=4 is 9*488 Since 
the calculated value is less than this value, the agreement between 
fact and theory is quite good and hence a Poisson distribution 
can be fitted to the data. 

Exercises 

1. One hundred and ninety-two families (for each of which 
the possibility of an albino child being born is otherwise 
established) had the following distribution of albinos among 
the first three children :— 

No. of albinos : 0 1 2 3 Total 

No. of families : 77 90 20 5 192 




^■DISTRIBUTION 


379 


Find the expected frequencies on the basis of a theoretical 
probability 0*25 of a child being born an albino and test 
the goodness of fit. (I. A. S. 1958) 

[Hint. p= 0*25, q= 1 — />=0*75. Hence the expected 
frequencies are given by 

192 (0 * 75 + 0 *25) 3 = 192 (|+J)s, 
which on calculations come out to be 81, 81, 27 and 3 respec¬ 
tively. The last two frequencies should be grouped together 
so that there are three classes in all. Here is only one 
constraint imposed by making the totals to agree so that 
number of degrees of freedom is 3 — 1=2.] 

[Ans. /* = 2 approx. The 5% value of* 1 for 
v = 2 is 5*991. Hence fit is good.] 

A list of wars of modern civilisation proved the following 
data for the years A. D. 1500 to 1931:— 


No. of outbreaks in the year : 0 1 2 3 4 5 Total 

No. of such years : 223 142 48 15 4 0 432 


Fit a Poisson distribution to the data and test the goodness 
of fit. (Madras B. A. 1954) 

Sfx 299 


[Hint. m= 


Zf “432 
frequencies are given by 

(0*69 r 
r! 


=0*69 approx. The theoretical 


432e-°- 69 


where r=0, 1, 2, 3, 4, 5. 


They come out to be as follows :— 

* 0 1 2 3 4 5 Total 

f, 216*7 149*5 51*5 11*8 20 0*3 431*8. 


In order to make the totals agree, the frequencies 216*7, 
149*5, 11*8 may be rounded up and the frequencies 51*5, 0*3 
may be rounded down. Thus, we get 

x 0 1 2 3 4 5 



/. 217 150 




After regrouping, we get 

(223 —217)* (142— 150)* (48 — 51 ) 2 (19-14) 2 
217 + 150 + 51 + 


7 . 


14 



380 


STATISTICS 


=2*57. 

v _4_2=2. 1 

The 5% value of x * for v = 2is5*99. Hence the Poisson 

distribution gives a satisfactory fit to the data.] • 

3. The following data show suicides of women in eight 
German States during fourteen years 

No. of suicides in 

a State per year :0 123456789 10 Total 

Observed 

frequency: 9 19 17 20 15 11 8 2 3 5 3 112 

Fit a Poisson distribution to the data and show that the fit 
is not satisfactory. 

4. Obtain the equation of the normal curve that may be fitted 
to the following data. Examine the closeness of the fit of 

the normal :— 

x 4 6 8 10 12 14 -16 18’20 22 24 Total 

f l 7 15 22 35 43 38 20 13 5 1 200 

[Ans. Normal curve fitted to the data is 

200 (x-13-85)»/29-22 

y V(29*22tt) - • 

X*-1’9. . 

The 5 per cent value of x 2 for v=s= 6 * s 12*59. 

Hence the fit is quite good.] 

Test for independence* Suppose we have a sample of 
individuals, which can be classified in two or more different 
ways. Very often we want to know that these classifications are 
independent. On the hypothesis of independence, we calculate 
the theoretical frequencies and calculate the value of X *. If the 
2_test shows that the divergence between observation and expecta¬ 
tion is significant, we have to reject the hypothesis. For example, 
the persons may be classified as inoculated and not inoculated, and 
also as attacked and not attacked by a disease. We may require 
to know if the two classifications are independent. Again fathers 
and sons may be classified according to one or more attributes 
and the relationship of the attributes in the two groups studied. 
Thus a very useful application of X 2 -test is to investigate the 
relationship between traits or attributes which can be classified 
into two or more categories. 

Usually the sample data are set out in two-way tables called 

contingency tables. 



381 


/^-DISTRIBUTION 


We have already found the number of degrees of freedom 
for a contingency table (see § 16*3). 

Coefficient of Contingency is given by 

C= \/ GvT?) ' 

If the null hypothesis is correct,/ 0 and f e will be equal for 
each cell (apart from sampling fluctuations), so that * 2 =0 and 
consequently, C=0. It is clear that C can never be unity. The 
maximum value of C depends on the number of cells. For 
example, it can be shown that for 2x2 table, the maximum value 
of C is 0*707. Hence it is essentia! to mention the number of 
cells when giving the value of C. 

16*12. Solved Examples. 


1. Show that in a 2x2 contingency table wherein the J're 
quencies are ^ , x 1 calculated from independent frequencies is 


X 2 = 


(a + b+c+ d) (a d— bc) ~ 

(a + b) (c+d) (b + d) (a+c)’ 

(Agra B. Sc. ’56, M. Sc. ’48, ’54, ’59; Punjab M. A. ’58; 
Delhi M. A. *51, B. A. Hons. ’47, ’49, ’52, ’59, ’61) 
Clearly the chance 


for an individual to 
be an A a ~*~ c 


is 


N 


and 


the chance for being a 


B is 


a+b 

N 


Since the 


Classes 

A 

I 

Not A ; Totals 

| 

B 

a 

b 

a + b 

Not B 

c 

d 

c + d 

Totals 

a+c 

b+d 

a + b+c + d= Nj 


classes are indepen¬ 
dent, the chance for 
an individual to be 
both A and B is 
a +c'a + b u 

iV AT * Hence the expected frequency for this sub-class in 
a sample of N is 


N 


N 


N 


Similarly, E (b) = ( ~-Jfffb + df 

N ’ 

E(c)je±*g±d) and E (d) J‘+d) (b ± d>' 





382 


STATISTICS 


Hence \ 


rt (a+c) <a+b) \* f(a±c)Ja±b)l 
L| a-\-b+c+d f I a+b+c+d J 


_ {a (a+b+c+d)—(a+c) (a+b)}* 
( a+c) {a+b) (a+b+c+d ) 
(ad—bc) z 


=2 


(a+c) (a+b) (a+b+c+d) 
{ad — be)* - 1 


, » 


(a+b+c+d) 
(ad-bc) z 


2 -r 


L 


(< a+c)(a+b) 
1 


1 


+ 


1 


(a+b+c'+d) L(a + c) (a+b) (a+b)(b+d) {c+d) (a+c) 

_1 

(b+d )J 

(ad-bc)* f a+ b+c+d , a+b + c+d ~| 

" (a+b+c+d) |_(a+&) (<*+c) (b+d) {a+c) {c+d) {b+d) J 

_ (ad—be)* , , , , h f c+d+a+b _"j 

“ {a + b + c +dy (a+b+ +d) L(o+^) (a+c) (b+d) (c+d)] 

__ (ad—be)* (a+b+c+d) 

~~ (a+b)(a+c)(c+d)(b+d) 

2. The individuals of two very large populations are divided 
into n sub-classes each and two random samples of sizes N t and N z 
are drawn from these populations respectively. The following are 
the observed frequencies for different classes :— 


Classes 

• 

1 

2 

• • • 

n 

Total 

Sample 1 

Mil 

Mi2 

• • • 

Min 

N* 

Sample 2 

M21 

M23 

• • • 

• 

Mm 

N % , 

Totals 

M 11 + M 21 

1 * • 

Miz + Ma2 

• • • 

-j 

Mm+M 2 n 

N t +^N2 


On the hypothesis that the probability that an individual falls 
into the rth class (r=l, 2 t .,.n) is the same in the two populations ; 

prove that 


" I ^ l 

X r = 1 1 Mir + M3 r J 


(Agra M. Sc* 1953) 









> 2 - DISTRIBUTION 


383 


On the given hypothesis, the probability of an individual 
falling into the rth class is • Hence the expected fre- 

quency in the rth class of the 1st sample is and that 

(A^ + A^) 

for the same class of the second sample is 

IV j “j" N g 


Hence **-£, [{-,,:■ 

+ {' i2r- A\ 


+ N t 

N 2 
+ N Z 


(u +u )Y/ N - 1 *f*r) 

IPir+Pir)]/ Nt + N, 

\* A^ g t^, r q-u 2r )~[ 

J 1 t- J 


(/*lr + ^2r) 


(A'j + A^g) (/x lr + ^a r ) 


+ 


(AW- N^ lr Y _] 

N-i (A'l-f-A^,) (^ lr 4-/^- r )J 


( AW-aw >* 

r=I ^1^2 (/*lr + /*2r> 

J (ff-%) l 

r= 1 l /^lr + ^Zr J 


Note. The above formula for x* provides a test for 
homogeneity , /.e. it provides a test to decide whether the two 
samples could have come from the same population. 

3. From the following table showing the number of plants 
having certain characters, test the hypothesis that the J,lower colour 
is independent of flatness of leaves . 



Flat leaves 

Curled leaves 

Total 

White Flowers 

99 

36 

135 

Red Flowers 

20 

5 

25 

Total 

179 

41 

160 


You may use the following table giving the value of /} for one 
degree of freedom, for different values of P. 

P ’99 '95 9 50 ’10 '05 ’01 


a* ’000157 •00393 ’455 2'706 3’841 6 635. 

(Agra M. Sc. 1957, I. A. S. 1946) 
On the hypothesis that the flower colour is independent of 
the flatness of leaves, the theoretical frequency for the plants 


having white flowers and flat leaves = 100 approx. 



384 


STATISTICS 


Other theoretical frequencies follow from the fact that the 
border frequencies for the rows and columns remain unchanged. 

The expected frequencies are shown in the following 2x2 contin¬ 
gency table. 


"v Leaves 

Flowers^ 

| 

Flat ! 

1 

Curled 

Total 

White 

; ioo 

i 

I 

135—100=35 

' 135 

i 

Red 

119-100=19 

25—19=6 

1 

25 

i 

Total 

119 

41 

160 


Hence z 2 = 


(99-100) 2 , (36-35) 2 (20- 19) 2 (5-6) 2 

+ ic ~r in T 


100 1 35 ' 19 

=0*001+0*029 + 0*053 +0*170 
= 0*253. 


The number of d. f. = (2 — 1) (2—1)=1. 

The 5 per cent value of x* for v=l is 3*841. The calcu¬ 
lated value of x* is much less than this value. As a matter of 
fact, the calculated value is less than even 50 per cent value of 
z 3 . Hence the difference between observed and expected fre¬ 
quencies is insignificant. It follows that the flower colour is 
independent of the flatness of leaves. 

4. The following table is published in a memoir written by 
Karl Pearson :— 


Eye colour 

Not light 

Eye colour in sons 
Not light Light 

Total 

in 


230 

148 

378 

fathers 

Light 

151 

471 

622 


Total 

381 

619 

1000 


Test whether the colour cf the son's eyes is associated with that 
of the father's. (You may use the fact that 5 per cent value of 
chi-square for 1 degree of freedom is 3’84.) (I. A. S. 1942) 

On the hypothesis that eye colours of sons and fathers are 
independent, the theoretical frequency corresponding to the 





^-DISTRIBUTION 


385 


observ'd 230 b »n W T® 11 - 144 T “ 

«>» ~ 

of rows and columns. These are 378 . j 

cell of the first row. 381-144=237 for the first cell of the second 

row and 619 - 234=385 for the second cell of the second row. 

(230-144)» (148 —234)= , (151—2371= (471-385) 

Hence ?'= 144 ~ +-234- + -237 + 485 

= 86X86 [ 1 -44 + 234 + 237 + 385] = 133 appr0X ' 

Now the 5% value of x * for U 3-841 Since the calcu¬ 
lated value of X J is much greater than this value, the yp 

of independence is rejected. It follows that there is associat.on 

between eye colours of fathers and sons. tuber* 

5. In an experiment with immunization of cattle from 

culosis, the following results were obtained : 

Affected Unaffected 

Inoculated 12 ^ 

Not inoculated 16 ; ,;h:litv tn 

Examine the effect of vaccine in contrail,ng susce P‘ ,b f ,y '° 
... (I. A. 

tuberculosis. „ . 

We set up the hypothesis that vaccine has no effect n contro 
lling susceptibility to tuberculosis. On this hypothesis 
expected frequencies are as shown in the following table : 

Affected Unaffected Total 


2 


Inoculated 

Not inoculated 
Total 


38x28 


60 

28—18= 10 
28 


s = 18 38-18 = 20 


22 — 10=12 

32 


38 

22 

60. 


(12 — 183 = ( 26 — 203 = ( 16 — 103 * ( 6 — 12) 2 

Hence x *=-jg-1- 20 + 10 " r 12 

= 36 (-A- + A-+A- + -A') = 36 X 0-198 = 7-1. 

Now the tabulated value for 5% level of significance for 
one degree of freedom is found to be 3 841. Since the calculated 
value is greater than this value, the hypothesis is wrong 
and consequently the vaccine is effective in controlling suscepti¬ 
bility to tuberculosis. 

6. The following table gives the numbers of boys and girls 
whose hair colour falls into five different groups. Find whether 

the hair colour and sex are associated. 



386 


STATISTICS 


Fair 

Boys 592 
Girls 544 
Totals 1,136 


Red Medium Dark Jet Black Total 
119 849 504 36 2,100 

97 677 451 14 j. 1,783 

216 1,526 955 50 3,883. 


Here we have an example of 2xn contingency table. The 
theoretical frequencies are calculated on the hypothesis that hair 
colour is independent of sex. These are as follows 

1136x2100 , tAAC ^ 

3 gg 3 — = 614*4 for boys with fair hair; 

216x2100 ' or _ 

— 3 ^ 3 —= 11 o*o tor boys with red hair; 

1526x2100 

3 gg 3 -= 825*3 for boys with medium hair; 

955x2100 t 

3383 — =»516*5 for boys with dark hair. 

The remaining expected frequencies are calculated with the 
help of border totals. Thus we get 

2100 —614 4—116*8 —825 3 —516*5 

= 27 for boys with jet black hair; 

1136—614*4=521*6 for girls with fair hair; 

216— 116 8 = 99*2 for girls with red hair; 

1526—825*3 = 700*7 for girls with medium hair; 

955 —516*5=438 5 for girls with dark hair; 
50—27=23 for girls with jet black hair. 


and 


Hence -g- (592-614 *4)* , (544-521*6)* . (119-116*8) 

X I., A. A H-SroT.2- + 


2 


+ 


614 4 “*■ 521*6 ^ 116*8 

(97-99*2) 2 (849-825*3)* (677-700 7) a 

+ «'•**-- + 700*7 


99 2 


825*3 


- + 


(504-516*5)2 (45I-438-5) 2 (36-27)* 

' I Al o#c I OT 


+ 


516*5 
(14-23)* 
23 


438*5 


27 


= 0*82 +0*96+0*04-f 0*05+0*68+0*80 

+0*30+0*36+3*00+ 3*52 


= 10 53 


and number of d. f.=(2 — 1) (5—1)=4. 


The 5% tabulated value of x * for v=4 is 9*49 and the 1% 
value of 13*28. Hence the result is significant at 5% level. 
It follows that there is some evidence of association between 
sex and colour so far as the 5% level of significance is concerned. 



^-DISTRIBUTION 


387 


Exercises 

1. The table given below shows the data obtained during an 
epidemic of cholera : 



Attacked 

Not attacked 

Total 

Inoculated 

31 

469 

500 

Not inoculated 

185 

1315 

1500 

Total 

216 

1784 

2000. 


Test the effectiveness of inoculation in preventing the 
attack of cholera. 

[5% value of x 2 for one degree of freedom is 3*84], 

(Indian Audit and Accounts Service 1941) 

[Ans. x 2== 14'64 approx. /. The inoculation is effective 
in preventing the attack of cholera.) 

2. The following table shows the results of inoculation against 
cholera :— 


4. 




Not attacked Attacked 

Inoculated 431 5 

Not Inoculated 291 9 

Is there any significant association between inoculation 

and attack ? Given that 

= ‘074 for x ,= 3*2 

069 for x 2 =3*3. (Agra M. Sc. 1958) 
The following data are observed for hybrids of Dhalura : 

Flowers violet, fruits pricky...47 
» », smooth.. 12 

white „ pricky. ..21 

„ smooth...3. 

Using Chi-square test, find the association between colour 

of flowers and character of fruit; given that 

/>=* 402 for x 8 = *7 
P—'399 for x , = , 71 

[Ans. x 8 =*66 approx. There appears to 
tion between colour of flowers and character of fruit.) 

The following table gives the observations on the condition 
of 246 children as regards their cleanliness and tidiness and 
the condition of the homes from which they came. 


99 


» 9 


99 




(Agra 1956) 
be no associa- 



388 


STATISTICS 


Condition 

Condition of Home 

Total 

of child 

1 

1 

Clean 

Not clean 


Clean 

76 

43 

119 

Fairly clean 

38 

17 

55 

Dirty and 
illkept 

25 

47 

72 

i 

Total i 

139 

107 

246 


Examine whether there is any association between the 
condition of the children and the condition of their homes. 

[Ans. x 2 =20*0. There is association between the condi¬ 
tion of home and the condition of children. 

5. The following table shows the association among 1000 school¬ 
boys between their general ability and their mathematical 
ability. Calculate the coefficient of contingency between the 

two. 


Maths. 

ability 

1 

General ability | 

Good 

1 

Fair j 

Poor 

Good 

44 

22 

4 

-- 1 

Fair 

265 

257 

| 

178 

Poor 

41 

I 

91 98 

| 


(Punjab M. A. 1945) 
j^Ans. C= Gy+p)* 0 ’ 25 approx., x 2 =70 approx.] 




389 


X s -DISTRIBUTION 


6. Show that for the entries in the following 2 xn contingency 
table : 3 


A 2 A < Total 



the value of \ 2 is 


X 2 = £ ojf { Px —pf y 


where 




(Delhi M. A. 1958) 
[Hint. Proceed as in Ex. 2 of § 16-12] 

16*13. Yates* correction for continuity. The distribution of 
X is essentially continuous but the distribution of frequencies is 
always discontinuous. This discontinuous distribution will tend 
to the continuous distribution as the size of the sample in- 
creases. It was the reason for supposing that no expected 
frequency should be small. When some expected frequency was 
small we regrouped the classes. However, this procedure is not 
possible in a 2x2 continency table. Making allowance for this 

Ya i e u smadeacorrection in the case of 2x2 contingency 
raoie This correction may be described thus : 


ex „S,Z re r Se ^ '■ " WSe Ce " fre, l uencie * »*‘ch are greater than 
xpected frequencies and increase by J those which are less than 
expectation. 


It will not affect the marginal columns, 
is known as Yates* correction for continuity. 

Formula of Ex. 1 of § 16*12 will then be given by 

..z- ( ad-bc-hN ) a N 


This correction 


(a+c) (b+d) (c~\~d) ( a+b)' 


— Cl) 


when ad -be > () 


390 


STATISTICS 


and 


[be — ad—lN) z N _ ... (2) 

x (a+c) (b-\-d) (c+d) (a+b) 

when ad—be < 0. 

Thus the result of applying Yates’ correction is to decrease 
slightly the value of x®. 

Example. The following information was obtained in a sample 
of 50 small general shops : _ 


Shops in 


■ 

Urban Districts \ Rural Districts 


Total 


L 




Owned by men 

17 

0 

18 

35 

! , 

Owned by women J 

| 

I 

I 72 

15 

Total 

20 

30 I SO 

i 


Can it be said that mere are remaps - -- 

general shops in rural than in urban districts . j 

We set up the hypothesis that there are an equa num e 
women owners of small shops both in rural and urban areas. 

Since one of the frequencies is small, we apply ates correc 
tion and the formula (1) of § 16* 13. Thus we get 

fl7y 12— 3 X 18-%x50)»_x50 
X --2uxlox35xT3 

125x 125x50 approx. 

20x30x35x15 252 

The 5% value of x 2 for one de S ree of freedora * s 3 ’ 841 lch 
is oreater than the calculated value. Hence the hypothesis 

Appears to be correct. It follows that there are no more women 

owners of small shops in rural than in urban areas 
Also the value of x 1 without Yates correction is 

(17x 12 —3xl8)»x50 
= 20x30x35X1^ 


150xl50x50_ == _ 
20 x 30 x 35 x 15 21 


=^= 3 * 57 . 



^-DISTRIBUTION 


391 


This value of x a is slightly less than the 5% value. Thus this 
result also supports the hypothesis though not so strongly as the 
previous one. 

16*14. Additive Property of * 2 . We state without proof 
the following theorem :— 

If the independent variates x l9 x t , ...x n all conform to ** 
distribution with v lt v it . . ,v n degrees of freedom respectively , the 
sum of the variables x x + Xs+...+ Xn is distributed like x 2 with 
v i + v 2 + ... 4- v n degrees of freedom. 

This additive property of x 2 is important in many experi¬ 
mental studies. It sometimes happens that when an experiment 
is repeated a number of times (the experiments being independent), 
the value of P calculated for each, may not be entirely conclusive! 
In such cases the value of calculated for different experiments 
may be added, to give a new x * with d. f.=the sum of the separate 
d. f.’s and then a test of significance can be obtained. 

Example 1. Five x 2 's computed from fourfold tables in 
independent replications of an experiment are O'50, 4 JO , I 20, 2 79 
and 5 41. Does the aggregate of these tests yield a significant x 2 ? 

Sum of the values of x 2 s = 0*50-f- 4*10+1*20-f-2*79+ 5*41 

= 14 00. 

Since for a fourfold table there is only one degree of freedom, 
the sum of the degrees of freedom is 5. 

Now the 5 per cent value of x ’ for d. f. = 5 is 11*070. Hence 
the sum of the values of x a is significant. 

It is to be observed that two of the individual values of * 2 
(/. e. 4*10,5-41) are significant and three of the values are not 
significant at 5 per cent level. Students may think that the result 
is not significant. But it is not so as has been shown above. 

Example 2. A certain hypothesis is tested by three similar 

eX t P l r lXT l f' 8QVe X Z=11 ' 9 f° r v = 6 > X 1 —14'2 for v=8 and 

* = f or Show that the three experiments together 

provide more justification for rejecting the hypothesis than any one 
experiment alone. 

The 5% value of x 2 for v = 6 is 12*592. Thus the value of 
m the first experiment is not significant. And 5% value of x * for 
V “ 8 is 15*507 so that the result of the second experiment also is 
not significant. Again 5% value of x 2 for ^=11 is 19*675, showing 



392 


STATISTICS 


that the third experiment also gives a non-significant result. 
Thus none of these experiments enables us to reject the hypothesis. 
Now the sum of the values of x 2 for the three experiments 

— 11*9+14*2+18*3 

; . =44*4 ,, . ; V 

and (he number of degrees of freedom 

= 6+8+11-= 25. 

The 5% value of x 2 for v = 25 is 37*659. Since the added 
value of x* is greater than this value, the hypothesis is rejected. 



CHAPTER XVIf 

THE SAMPLING OF VARIABLES : SMALL SAMPLES 

17’L The theory of small samples differs markedly from 
that of large samples. In the case of large samples, for example, 
we took as an estimate of a parameter in a population the value 
calculated from the sample and used this value in gauging the 
precision of our estimates. Again it will no longer be possible 
in small samples to assume that the random sampling distribution 
of a statistic is approximately normal. A new technique is 
therefore necessary to deal with the theory of small samples. We 
shall therefore discuss briefly the basis on which estimates of given 
parameters are to be made. We shall confine our work to finding 
the estimates of the mean and the standard deviations since these 
are the two main parameters of interest. 


17*2. Estimates of the arithmetic mean. As in the case of 
large samples, we take as our estimate of the population mean 
the value of the sample mean. Thus if x t , x 2t .. .x n are the values 
in a sample, the estimate of the mean in the population is given 

1 n 

by st where *=- Z x r . 

n r = 1 


17*3. Estimates of the variance. (Agra 1948) 

Let ^ denote the mean and o the standard deviation of the 
population. 

If ft is known, the estimate of the variance of the population 
is given by s 2 where 



Z 

r= 1 


(X r -ft)\ 



But generally we do not know the value of ft , which 
have to estimate. 

If s is the sample mean, we have 

£ ( x r — f*)*=Z + 


Hence 


=£(* r -S)*+27(S-,x)* 

=Z («—**)*. 

Z (X'-Xf+V-n)’. 


we shall 



394 


STATISTICS 


Thus we see that s 2 differs from the variance of sample 

n ^ by a quantity (.? — /*) a which is essentially positive 

and will not, in general, vanish. Hence by taking the 
variance of the sample to be an estimate of the variance of the 
population, we shall commit a systematic error of magnitude 


Now this term (£ — /*)* is the square of the deviation of the 

mean of the sample from the mean of the population and its 

mean value in a large number of samples is the variance of the 

2 

mean which we know to be equal to — where a is the standard 

n 

deviation of the population. We then have 


If the value of a is unknown, we may take it approximately 
equal to s. Thus we get 

S -n ~ <*'-*>* +,7 


or 


i.e. ■ 




s 2 = 


1 

(Vi-1) 


2?<x r -*)* 


...( 2 ) 


where cr„ is the standard deviation of the sample values. 


17*4. *t-Distributioo. Using the notation of 
article, we define a new statistic t by the equation 



. -v — u 

Vn= - \/(n—l) 
o. 


the previous 


s — u 

or /= -JT V(v'-H), 

where v (=« —1) denote the number of degrees of 
freedom of t. 


*It was W. S. Gossct who under the nom deplume (pen name) of ‘Student’ 
first found the distribution of t= x ~*. R. A. Fisher later on defined / correctly 

by the equation l = X ~~ 4 n an d found its distribution in 1926. Students 

should not confuse Fisher's /-distribution with his e-distribution defined later 
on. 



THE SAMPLING OF VARIABLES 


395 


Then it may be shown that, for samples of size n from a * 
normal population, the distributiou of t is given by 


v + T 

o+o 2 

If we choose y 0 so that the total area under the curve is 
unity, we shall get 



We can easily study the form of the /-distribution. Since 

only even powers of t appear in its equation it is symmetrical 

about t =0 like the normal distribution, but unlike the normal 

distribution, it has y 2>0 so that it is more peaked than the 

normal distribution with the same standard deviation. Also y 

attains its maximum value at /=0 so that the mode coincides with 

the mean at t=0. Again the limiting form of the distribution 
when v-+oo is given by 


y = y 0 e-i 1 '- 

It follows that t is normally distributed for large samples. 

17*5. Probability Tables of t-Distribution. The chance that 

on random sampling, we shall get a value of / not greater than 
some value t 0 is the area of the curve 


y= 


y 0 




\ 


v-M 
2 

to the left of the ordinate at the point t 0 and is given by 

'° y dt. 

— OO 

Similarly the probability that we get a value of / between 
the limits r, and t % is given by V* y dt. 

J 

Student himself prepared tables of the integral 

for various degrees of freedom and different values of t which 
are generally required in practical problems. 

The Fisher-Yates tables give the area given by 


y dt. 



396 


STATISTICS 


Relation between P s and P F . From the above definition of 
P s and P F it is clear that P F denotes the probability that an observ¬ 
ed value will not exceed /„ whereas P F denotes the probability that 
an observed value of /, regardless of sign , will exceed /„. 

Now /$= y dt= y dt+j* y dt = h + y dt 

...( 1 ) 

and /> f=1“|^_ < y 2 J* y dt. ...(2) 

Multiplying (!) by 2 and adding the result to (2), we get 

2P S +P p =2 

or Pf~2 (l— P s ), . 

17*6. Uses of t-distribntion. We have seen that if the 
sample is large, the use is made of the tables of the normal 
probability integral in interpreting the results of an experiment 
and on the basis of that to reject or accept the null hypothesis. 
If, however, the sample size n is small, the normal probability 
tables will no longer be useful. It was the discovery of the /-dis¬ 
tribution by Student in 1908 that provided the answer to this 
difficulty. A rigorous proof of Student's /-distribution was pro¬ 
vided some years later by R. A. Fisher. New tables giving the 
significance levels of observed values of / were prepared. There 
are various uses of /-distribution. A few of them are mentioned 
below : 

(a) To test the significance of the mean of a small random 
sample from a normal population. 

(b) To test the significance of the difference between the 
means of two samples taken from a normal population. 

(c) To test the significance of an observed coefficient of 
correlation including partial and rank correlations. 

(d) To test the significance of an observed regression coeffi¬ 
cient. 

We now discuss some of these tests in some detail to em¬ 
phasize the importance of /-distribution. 

17*7, Test of significance of the mean of a random sample 
from a normal population. 

Suppose we have to test the hypothesis that the mean of 
a normal population from which a small sample has been drawn 



THE SAMPLING OF VARIABLES 


397 


is ft. For this we first calculate r= ( -- /x) and then from the 

tables, find the value of P F for the given degrees of freedom. 

< *05, we say that the difference between the population 

mean and the sample mean is significant at 5 per cent level of 
significance and the hypothesis is to be rejected. If P F < *01, 

the difference is said to be sigificant at 1 per cent level of signi- 
cance. In this case, the difference is said to be highly significant. 
On the other hand, if P F > *05, we say that the data are consis¬ 
tent with the hypothesis that p is the mean of the population. 

Note. We shall specifically assume that the parent popula¬ 
tion is normal unless otherwise stated. 


17*8. Solved Examples. 

1. Find the Student’s t for the following variate values in a 
sample of eight — 4, —2, —2, 0, 2, 2, 3, 3, taking the mean of the 
universe to be zero. (Agra M. Sc. 1948) 


S=mean of the sample = J = ‘25. 
o a *=s. d. of the sample 

= k f( —4 —*25)*-t-2 ( —2 —*25)*+(0 —*25)* 

+ 2 (2 —-25) 2 + 2 (3 — *25) 2 J 
= £ [(4-25) 2 + 4 {4-H*25)*>+ 025)*+2 (2-15)*) 

^ p 89 65 J_ I2I-] 

*(_ 16 + 4 + 16 + 8 J 

792 _9 > 

“8x 16“ 16* 


y/ (99) 


= 2-487 


Hence 


= ’27 nearly 



The nine items of a sample had the following values : 
45, 47, 50, 52, 48, 47, 49, 53, 51. 


Does the mean of the nine items d'Jfer significantly from the 
assumed mean of 47'5 , given that 


P= -945 for t*= T 8 ? 

P=-953 for t=19‘ (Agra M. Sc. 1953) 



398 


STATISTICS 


We find the mean and standard deviation of the sample as 

• . ' —- * ' % 

follows : 


X 

d 

d 2 

Remarks 

45 

j 

-3 

i 

9 

• . . - ' . . - ** i , „.. , 

^4=assumed mean=48. 

47 

-1 

1 


50 e 

2 

4 

£=mea 

52 

4 

16 

=49*1. 

48 

0 

0 

t 1 t»J2 ^ 

fi 2 =y Zd 2 =—. 

47 

-1 

1 

<y, 2 =/x 2 '~ jz,' 3 

49 

1 

1 





66 100 

5j 

5 

25 

II 

^1; 

i 

oc 




494 

5i 

3 

9 

“IT 




i.e. a,=2 “47 

Total 

10 

66 



TT . 3 — IX .. 49*1 — 47*5 yo . 1#0 , 

Hence f=—“ \/(n —\)=—— V8 — 1*83. 

Gg Z 4/ 

Here v=9—1 = 8. 

Now difference for •1 = *008. 


•|lf 1V ' i 

.*. difference for 03 =-tj- X *03= *0024. 

Hence when t= 1 *83, P= *945+ *0024= *9474. 

Here clearly P stands for P s so that we have 

Pp—2 (1-*9474)=* 1052 > *05. 

It follows that the value of / is not significant at 5% level of 
significance and therefore the test provides no evidence against the 
population mean being 47*5. 






THE SAMPLING OF VARIABLES 


399 


3* Ten individuals are chosen at random from a population 
and their heights are found to be in inches : 63, 63, 66, 67, 68, 69, 
70, 70, 71, 71. In the light of these data , discuss the suggestion 
that the mean height in the population is 66 inches having given that 
for t=T8, P—0'947 and for t = T9, P=0'955, where P is the 
area to the left of the ordinate at t. 

(Agra B. Sc. 1955, M. Sc. 1954) 

Wc find mean and s. d. of the sample as follows : 


fd 


Remarks. 


-4 = Assumed mean = 67 


66 


'**”10 8 . 


67 


2 — mean = 67 *8 inches 


*'■-15 ^ 


. , ,. 88 64 204 

.. a, -/*, pi -,0-luO^ 25 


70 


9 18 


. v'204 . _, . . 

i.e. t,= v —— = 2*86 inches 


Total 


8 16 32 


—67-8-66 5*4 


f= 


^"l)"Tlf^9"i..> I , 89- 


2*86 


Now difference for • 1 = *008. 


difference for • 09 = *. x • 09 = 0072. 

Hence for l *89, />=0*947+ 0*0072 

= 0*954. 

Pp= 2 (1—0*954) = 0-092 > *05. 

follows that the difference is not significant at 5% level 







400 


STATISTICS 


of significance and the test provides no evidence against the 
population. 

4. Ten individuals are chosen at random from a population, 
their heights are found to be 63, 63, 66, 67, 68, 69, 70, 70, 71 and 
71 inches respectively. Test whether the mean height is 69'6 inches 
in the population, given that for 9 degrees of freedom 

P{\t\ > 2'262}=0’05. 

(Agra M. Sc. 1962) 

As in Ex. 3 above, we get 

5=67*8 inches and cr 3 =2*86 inches. 

Hence V(«~ D- ^g 9 ' 6 V®. 

*** J /,= 2786 =1 89# 

Since this value of (/| is less than 2*262, the difference is 
not significant at 5% level of significance and the test provides no 
evidence against the population, mean being 66 inches. 

5. Show that the 95 per cent fiducial limits for the mean /x of 

the population are 5 T t 0 , 05 , where / 0 . 0 * 5 stands for the value of t 

at 5 per cent level of significance . Deduce that for a random 
sample of 16 values with mean 4T5 inches and the sum of the 
squares of the deviations from the mean 135 (inches)* and drawn 
from a normal population , 95 per cent fiducial limits for the mean 
of the population are 39 m 9 and 43’ 1 inches . 

The probability that t | < / 0 . 05 is *95. Hence the 95 per 
cent confidence limits for the mean n of the population are 
given by 

3-p 


! t |= 


y/n < 


0*05 


or 


or 


| 5 — p | < 


y/n r °-° 6 


*“-*/« '°' 05 


<5 V- 3 4- 72 *0*05 


s 

V" 


We can therefore say with a confidence coefficient *95 that 


the confidence interval 06 

y/n 


contains the population mean n. 

For the given numerical example, we have 

2 _27(.v-5) 8 135 


n -1 = IS 


so that s=3. 



THE SAMPLING OF VARIABLES 


401 


Also from the tables, t 0 . 05 =2*13. 

s 

^o o5=i x 2* 13 = 1 *6 approx. 

Hence the required fiducial limits are 41*5 + 1*6 i. e. 39 9 
and 41*3 inches. 


6. A machinist is making engine parts with axle diameters 
of O'700 inch. A random sample of 10 parts shows a mean dia¬ 
meter of O'742 inch with a standard deviation of 0 040 inch. On 
the basis of this sample, would you say that the work is inferior ? 

Here, we have 

^=0-700, £ = 0*742, ct s = 0*040, n=10. 




0*126 

0*040 


= 3*16. 


From the tables, we get for 9 degrees of freedom, 

7^ = 0*995. 

P F = 2 (1—0*995) = 0*01 < 0*05. 


Hence the difference is significant at 5% level of significance. 
Hence the work is inferior. As a matter of fact, the work is 
inferior even if we make the level of significance only 2 per cent. 

7. A certain type of thread has a mean breaking strength of 
1' 32 ounces. A change in its manufacture is recommended with a 
view to increasing its breaking strength , but the cost of the proposed 
change will be substantial. A sample of 50 pieces of the new thread 
is tested and found to have a mean strength of 1 • 39 ounces with a 
variance of O'0424. Discuss whether the new sample fulfils 
expectations to the extent that the change should be made, first by 
using the normal curve and secondly by using (-distribution. To be 
quite sure of correct conclusion , a 1 per cent level of significance 
is to be used. You may make use of the following tables. 

' (a) The area A under the standard normal curve lying to the 

left of the ordinate at the standard normal variable z , 

z 2 m 38 2’39 2'40 2'41 242 

A O'9913 O'9916 O'9918 O'9920 O'9922. 

(b) At 1% level of significance. t=2'70 for v = 40 and t = 2'66 
for v~60 where v denotes the number of degrees of freedom. 

Firstly, if we regard this as a large sample, we have 


2=*“** /„ 1-39-1-32 

V "V(0 0424) 


V(50) = 2*40. 



402 


STATISTICS 


From table (a) above, we see that for r=2*40, ^=0*9918, 
so that the probability that the sample value of z exceeds 2*40 
in absolute value is 2 (1—0*9918)=0*016 which differs very 
slightly from *01. It appears, then, that we have a border line 
case so far as the 1 per cent level of significance is concerned 
and that the result is hardly significant. 

Again, by using the f-distribution, we find from table (b) 
that for 49 degrees of freedom. 


/*=2-68, 

and the value calculated from the sample is given by 

V(”-l)=^W ) 2 V49=2-38 

which is less than 2*68. Hence the difference is not quite sign! 
ficant at 1 per cent level of significance. 


This is an example of the occasional need for a careful study 
of the variable t or z, to be used, and the usefulness of a more 
complete table of t. 


8. The mean weekly sale of the Yum-yum chocolate bar in a 
chain of candy stores was 146'3 bars per store. After an adver¬ 
tising campaign the mean weekly sales in 22 stores for a typical 
week increased to 153'7 and showed a standard deviation of 17*2 • 
Is the evidence conclusive that the advertising was successful ? You 
are given that for 21 degrees of freedom , the value of t is 2'OS at 
5% level of significance. 

Here 3 = 153*7, 146*3, n = 22, <t,= 17*2. 


r .^ V( ,- 1) . l . S 3 ;7- 1 46 ; 3 V21 


1-97. 


Sinee the calculated value of / is less than the given value, 
the difference is not significant at 5 per cent level of significance 
i. e. the advertising is not very successful so far as this level of 
significance is concerned. 

9. An observer made the following observations on the vertical 
diameter of the planet Venus (in seconds of arc) :— 


42'70, 42'56 f 43 01, 43'48. 42'76 , 43 06, 43'63 t 42 87 , 4T60, 
42-78 , 42-95, 43'20, 43'18, 43'39, 43'10. 

Assuming that the population of readings is normally distri¬ 
buted about a true value, which is estimated by the arithmetic 
mean, calculate 95 per cent confidence limits for the vertical dia¬ 
meter of Venus. 


THE SAMPLING OF VA RIABLFS 


403 


We first calculate mean and S. D. as follows :— 


X 

(*-43) 

(*-43) 

42*70 

—0*30 

0*0900 

42*56 

— 0*44 

0*1936 

43*01 

0*01 

0*0001 

43*48 

0*48 

0*2304 

42*76 

-0*24 

0*0576 

43*06 

0*06 

0*0036 

43*63 

0*63 

0*3969 

42*87 

-0*13 

0*0169 

41*60 

o 

• 

7 

1*9600 

42*78 

-0*22 

0*0484 

42*95 

-0*05 

0*0025 

43*20 

0*20 

0*0400 

43*18 

0* 18 

0*0324 

43*39 

0*39 

0*1521 

43*10 

0* 10 

0*0100 

Assumed mean 

-0*73 

A =43. 

3*2345 


0*73 

* = 43 -~ =43-0*0487 


and 







= 42*95 approx. 

_ 3*2345_/0*73\ 2 

15 V 15 ) 

= 0*2156-0*0024 = 0*2132. 

= 0*46 approx. 


Now the five per cent value of t for 14 degrees of 
2*145 /. <?. f 0 . 05 = 2*145. 


freedom 



or 

or 

or 

or 


Also f=- / V("-1). 

Hence the 95 per cent fiducial limits for \jl are given by 

I S—u i 

- y/( n i) ^ ^o*o9 

u r I 

< ** < R + y/(n-i) '"■«» 

42 ' 95 “V(14) ><2 - 145 < " < 42*95+ X 2*145 

42 95-0*26 < fx < 42*95 + 0*26 
’ ' * 42*69 < ix < 43*21. 




404 


STATISTICS 


Hence 95 per cent confidence limits for the vertical diameter 

of Venus are 42*69 and 43*21. 

Exercises 

1. Ten individuals are chosen at random from a population and 

their heights are found to be be in inches : 63, 63. 64, 65, 
66, 69, 69, 70, 70, 71. Discuss the suggestion that the mean 
weight in the universe is 65 inches, given that for 9 degrees 
of freedom the value of Student’s t at 5 per cent level of 
significance is 2*262. (Agra M. Sc, ’57) 

[Ans. f=2*02 < 2*262, so that it can be said that the 
mean height in the universe is 65 inches.] 

2. Find Student’s t for the following variate values in a 
sample of 10 :— 

-6, -4, -3, —2, —2, 0, 1, 1, 3, 5, 
taking F to be zero, and find from the tables the probability of 
getting a value of t as great or greater on random sampling 
from a normal population. 

[Ans. r=—0*662, v = 9, P s = 0*738, so that 

P F =2 (1—0*738)=0*524. 

.*. the probability that we should get a value of / greater 
in absolute value is 0*524.] 

3. A certain stimulus administered to each of 12 patients 
resulted in the following increases of blood pressures :— 

5, 2, 8, — 1, 3, 0, 6, —2, 1. 5, 0, 4. 

Can it be concluded that the stimulus will be in general, 
accompanied by an increase in blood pressure ? 

(Delhi M. A. Eco. ’57) 

[Hint, X = 2*58, a, = 2*97, n= 12. 

Taking the hypothesis that the stimulus does not result in 
an increase in blood pressure, that is /z=0, we get 

The value of ffor 11 degrees of freedom at 5 per cent level 
of significance is 2*201, which is less than the calculated value 
of t. Hence the hypothesis is refuted. It can therefore be said 
that the stimulus will in general be accompanied by an 
increase in blood pressure.] 

4. A machine which produces mica insulating washers for use 
in electric devices is set to turn out washers having a thickness 
of 10 mils (1 mil=0*001 inch). A sample of 10 washers has 


THE SAMPLING OF VARIABLES 


405 


an average thickness of 9-52 mils with a standard deviation 
of 0*60 mil. Find out the significance of such a deviation. 

[Ans. t= 2’528, deviation is significant.] 

A sample of 11 rats from a central population had an 
average blood viscosity of 3‘92 with a s. d. of 0 61. On the 
basis of this sample, establish 95 per cent limits of //, the 
mean blood viscosity of central population. 

[Ans. 3-51 and 4*33.] 
II the mean I. Q. of 24 delinquent boys is 89 4 with a standard 

deviation of 11*0, find the 95 per cent confidence limits for 
the population mean [Ans. 84*6 < ^ < 94 2.] 

The scores of 10 students in a test were as follows 

65. 70, 86, 83, 74, 90. 94, 57, 65, 76. 

If the mean score of students in general in this test be 69 
with standard deviation 9*3, would you consider this group 
different from the general run of students. 

[Hint, * = 76, /x = 69, cr = 9*3. 

•• t= ~v~ V«=-9r 3 -v / (10) = 2-4. 




cui ic^ponaing 


For 9 degrees of freedom, we 
this value of /, -Py=0*980. 

Pp—2 (1 —0*980) = *04 > 0*01 and ^ 0'05. 

Hence the difference is not significant at 1 per cent level of 


The average breaking strength of steel rods is specified to be 

tested a n 3 P °. Un r S ; To this a sample of 14 rods was 
ested and gave the following results (in units of 1.000 lbs.) : 

fs . 15# 18 ; 16 » 21 * ,9 » 21 » I7 » 17 » 15, 17, 20, 19, 17. 18. 
is the result of the experiment significant ? 

Thn . , [Ans. No; /= 1*23.1 

hosDitari 8 3 at bi u h ° f 15 babiCS b ° rn in a Calcutta 
nosp.tal are g.ven below. Each figure is correct to the 

nearest tenth of a pound. 

7-9. S-S 6 7 ’ 7 *' 6 9 ’ T5 ’ 5 ‘ 7 ’ 4 ' 8, 6 ' 8, 7 ' 6 ’ 7 ' 8 - 81 - 5 °. 5 8, 

hir.u"^ J per cent fiducial limits for the mean weight at 

"=2 977 a" I 5 "* 1 babieS> 8Wen ' hat f ° r 14 °f freedom 

2 977 at 1 per cent level of significance. 

[Ans. 5-931 < M < 7 723, * = 6'827 lbs., j=M26.J 



406 


STATISTICS 


10. A certain colliery is supposed to supply coal of ash content 
about 15. To test this, twenty random samples of the 
colliery’s coal are selected and tested. The null hypothesis 
is that the ash content is in fact 15. The results of the 
twenty tests gave an average ash content of 16*8 with a 
standard deviation of 3*6. Is this sufficient evidence on 
which to reject the hypothesis ? You are given that five per 
cent value of t for 19 degrees of freedom is 2-09. 

[Ans. Yes; r=2*2J 

11. The mean yield per plant for 11 tomato plants of a particular 
variety was found to be 1,284*73 gms. with a standard 
deviation of 96*41 gms. Set up 99 per cent fiducial limits for 
the mean yield of all plants of this variety. 

[Ans. 1,192 61 and 1,376 85 gms.] 

12. The following are 12 determinations (in degrees centigrade) 
of the melting point of a compound made by an analyst, the 

true melting point being 165°C. 

Would you conclude from these data that his determi¬ 
nations are free from bias ? 

1644, 169-7, 163*9, 162-1, 1609, 160*8, 161*4, 162 2, 168 5, 
163-4. 162-9, 167*7. [Ans. Yes; /=1*149.] 

17*9. Test of significance of the differences between the 
sample means. Let there be two independent samples x x , x 2t 
x 3t ...x n and y» yz, y 3 > • • with means 3 and y and standard 
deviations <x, and a 9 from normal populations with the same 
variance. We have to test two hypotheses : 

(i) The population means and fx 2 are different, 

(ii) The population means /xj aDd /x 2 are the same. 

For (i) variate t is defined by the relation 


. (*-y)—Qxt—M 2 ) 

' - • 1 ' 1 /a 


\«i n z J 





where s2== ni +n 2 -^ 2 ^i a ® 2 + 

=—-—^ (*-s) E +27 ( y-yn 


It can be shown the variate t defined by (1) has Fisher’s 
/-distribution with ti x +n z —2 degrees of freedom since both S 
and y are calculated from the data. 



SAMPLING OF VARIABLES 


407 


And for (ii), t= ——Z— 

,ri + i-Y <2 ’ 

\«i r nJ 

having the same distribution with the same 
freedom. 


...( 2 ) 


number of degrees of 


Paired data. If the two samples are of the same size and 
the data are paired, then we define / by the relation 


l-tff 


V”, 


...(I) 


where s z 

d, 

d 

and 


1 


n-1 s (dr-*)*; 
differences of the rth members of the samples; 

mean of the differences, i.e. d=~ E d' 

n * 


/z —population mean of the differences 


If 


A* = 0, t= y/n. 

s 


...(4) 


The number of degrees of freedom in this case =// — !. 

17*10. Solved Examples. 

1* Eleven scho ° l b °y s were given a test in drawing. They 
were given a month's further tution and a second test of equal 

T, !Wld , " "‘ e ^ ° f “■ D ° the marks Sive evidence 
that thdstudents have benefited by the extra coaching ? 

Boys 

1 
2 

3 

4 

5 

6 
7 

I 8 

9 
10 
11 

We calculate the mean and the standard deviation of the 


Marks 

Marks 

1st test 

2nd test 

23 

24 

20 

19 

19 

22 

21 

18 

18 

20 

20 

22 

18 

20 

17 

20 

23 

23 

16 

20 

19 

17 



408 STATISTICS 


differences between the marks of the two tests as follows : 


Boys 

X 

y \ 

1 

d=y-x 

d-d 

. . ' •. 

(d - d? 

1 

23 

24 

1 

0 

0 

2 

20 

19 

— 1 

-2 

4 

3 

19 

22 

3 

2 

4 

4 

21 

18 

—3 

-4 

16 

5 

18 

20 

2 

1 

1 

6 

20 

22 

2 

i 

1 

7 

18 

20 

2 

1 

1 

8 

17 

20 

3 

2 

. 4 

9 

23 

1 23 

0 

-1 I 

1 

10 

16 

| 20 

4 

3 

9 

11 

19 

! 17 l 

— 2 

-3 

9 

Total 

1 


Sd= 11 


27(</-J?=50 


Mean of the difference = 

* s = n 4, * «*-^4o= 5 ' 

so that 5=2*24. 

Now we take the hypothesis that the students have not been 
benefitted by the extra coaching. This implies that the mean of 
the differences between the marks of the two sets is zero, i.e. 
fz=-0. Then • 

* 

,= S ~T 0 Vn=^Vn = l-482 

and v=11 — 1 = 10. 

The value of / for 10 degrees of freedom at 5 per cent level 
of significance is 2*228. The calculated value is very much less 
than this value and so the difference is insignificant. That is, the 
test provides no evidence that the students have benefited by the 

extra coaching. 

2. The sleep of 10 patients was measured for the effects of 
the soporofic drugs referred to in the following table as Drug A 
and Drug B. From the data given below show that there is a 
significant difference between the effects of two drugs , on the 
assumption that different random samples of patients were used to 
test the two drugs A and B. 




the SAMPLING of variablps 


409 


Additional hours of sleep gained by use of soporofic drugs 


Patient 

1 

2 

3 

4 

5 

6 

7 

8 
9 

10 


Drug A 

Drug B 

0'7 

1’9 

- 1-6 

0-8 

- 0-2 

1-1 

-I'2 

0-1 

-0-1 

—0-1 

3'4 

4-4 

3 7 

5-5 

O'8 

16 

0 

4-6 

20 

3-6 


(Delhi B. Com. 1959, PuDjab M. A. 1954) 
Calculations of the mean and S. D. of the difference *— 





410 


STATISTICS 


Mean of the difference 



1 




13-58 


d-0 . 1*6-0 

^~"T" v/,, “V(13'58) # 


3y/l0 


4-8 


4-8 


V(l*358)”l-165 


=4-12. 


The value of t for 9 degrees of freedom at 5% level of signi¬ 
ficance is 2*262. The calculated value is considerably greater 
than this value. Hence the difference is highly significant. The 
drug B is therefore more effective than the drug A. 

3. A farmer grows crops on two fields , A and B. On A he 
puts pound one worth of manure per acre and on B pound two worth. 
The net returns per acre , exclusive of cost of manure , on the two 

fields in five years are :— 


Year 

Field A 
lbs per acre 

Field B 
lbs. per acre 

1 

17 

is 

2 

14 

16-5 , 

3 

21 

., 24 

4 

18-5 

19 

5 

22 

25 


Other things being equal , discuss the question whether it is 
likely to pay the farmer to continue the more expensive dressing. 
State clearly the assumptions which you make . 


We should calculate the mean and standard deviation of the 
differences of the net returns including the cost of manure. Thus 
in this case the net returns in the first year for the fields A and B 
are 17 —1 = 16 lbs. and 18-2=16 lbs. respectively and similarly 
for other years. Calculations are shown in the following table : 



THE SAMPLING OF VARIABLES 


411 


Year 

X 

y 

d=y—x 

d-d 

(d-df 

1 

16 

16 

0 

' 

1 

2 

13 

14*5 

1*5 

0-5 

0-25 

3 

20 

22 

2 | 

1 

1 

4 

17*5 

17 

-0-5 

-1*5 

2-25 

5 

21 

23 

2 

1 

1 

Total 


5 


5 5 


mean </=- = ! lb. 


and 


j 2 = 


~l Z (d~d) 2= \ x5'5=l *375 


From our knowledge of crop yields, we expect them to be 

distributed in a single-humped form not very far removed from 

the normal. We further assume that the two manures have the 

same effect on yield. Then the differences in the returns of the 

two fields will be distributed in an approximately normal form 
about zero mean. 


Hence / = —— ^Wn^ —— a /5 =l -007 

s v V(l*375) v3 1 yu/ - 

Corresponding to this value of f, we find from the tables that 

^ = 0-935 ; 

/. P F = 2 (1 — *935) = * 13 > -05. 

Thus if we take our level of significance as 5%, the calculated 
value of / is not significant and in that case it is not worth while 
to continue the more expensive dressing. 

This difference is, however, significant at 15% level of signi¬ 
ficance and in that case, we are led to suspect our hypothesis that 
the two manures exert equal influence on yield and we may suppose 
though with very little confidence (with 87% confidence only), 

so far as these data are concerned, that more expensive dressing* 

is better. b 

4. The following table shows the mean number of bacterial 




412 


STATISTICS 


colonies per plate obtainable by four slightly different methods from 
soil samples taken at 4 P. M. and 8 P. M. respectively : 

Method A Method B Method C Method D 
4 P. M. 29'75 27*50 30*25 27*80 

8 P. M. 39'20 40'60 36'20 42*40 

Are there significantly more bacteria at 8 P. M. than at 4 P. M. 

(Delhi B. A. Hons. 1959) 

Calculation of the mean and standard deviation of the 
differences between the number of bacterial colonies per plate at 
. 4 P. M. and 8 P. M. :— 


Method 

4 P. M. 

X 

8 P. M. 

y 

d=y—x 

_ i 

d-d , 

, (d—d)* 

A 

29-75 

39-20 

9-45 

-1-325 

1-756 

B 

27-50 

40 60 

13-10 

2-325 

5-021 

C 

30-25 

36-20 

5-95 : 

-4-825 | 

24-281 

D 

27-80 1 

42-40 

14-60 

• 

3-825 

14-631 

Total 



43-10 

: 45-689 


Mean 


43*1 


4 

45-699 
4-1 


5 = = 


10-775, 

= 15-2296. 


We take our hypothesis that there is no difference between 
the bacteria at 4 P. M. and 8 P. M., i. e. /x=-0. 


r, . d-0 . 10*775 .. 

Hence t _ yw /n c.*nti4.\V^ 


21-550 


5-53. 


5 * V(15'2290) v ’ V(15-229t>) 

Corresponding to this value of we have from the tables for 


three degrees of freedom. 


P s = 0-994, 


P F = 2 (1 —-994)=*012 < -05. 

H.nce the difference is significant and the hypothesis is 
refuted. It follows that there are significantly more bacteria 
at 8 P. M. than at 4 P. M. 

5. The yields of two types 'Type 17 * and *Type 5V of grains 
in pounds per acre in 6 replications are given below. What 
comment would you make on the differences in the mean yields ? 









THE SAMPLING OF VARIABLES 


413 


Y011 ma y assume that if there be 5 degrees of freedom and 
P—0 2, t is approximately T476. 


Replication 

1 

2 

3 

4 

5 

6 


Yield in pounds 
Type 17 
20'50 
24-60 
23'06 
29‘9S 
3037 
2383 


Calculation of the mean and 
yields of the two types of grains :— 


Yield in pounds 
Type 5 / 

24-86 

26-39 

28- 19 
30-75 

29- 97 
22-04 

(I. A. S. 1951) 
S. D. of the differences in the 


Replication 

X 

y 

d—y — x 

• 

d-d 

(d-d)* 

1 

• 

20-50 

24-86 

4-36 

2-717 

1 

1 

7-38 

2 

24-60 

26-39 

1*79 

0-147 

0-02 

3 

23*06 

28-19 

5*13 

3-487 

12*15 

4 

i 

29-98 | 

30-75 

0-77 

-0-873 

f 

! 0-76 

5 

; 30-37 

29-97 

— 0-40 

— 2-043 

4-17 

6 

23-83 

22-04 

— 1-79 

-3*433 

11-79 


1 

| 

i 

i 

9-86 

| 


36-27 


Mean of the differences of the yields=f/ = 9 l‘643 lbs. 


. E ( d-d )* 36-27 

5 — —— 5 -• so that ^=2-69 lbs. 

We adopt the null hypothesis that the difference in types 
has no effect on yields, /. e. n = 0. 


, d -0 

*- T'/n 


1*643 
2 69 


•s/6= 1*489. 


Now for the given value of/= 1-476, we have ^ = 0 2 for 5 
degrees of freedom. The calculated value is greater than this value so 


414 


STATISTICS 


that the difference is significant at 20% level of significant although 
it is not significant at 5% conventional level of significance. 

6. Ten pairs of maize plants were grown in parallel boxes , 
and one member of each pair was “ treated ” by receiving a small 
electric current. The differences in height between the treated and 
untreated in mm. were :— 

(treated) — (untreated) : 6’0 9 T3 , 10’2, 23‘9, 3‘1, 6’8 y — l'5 f 
-14-7, —3'3 and 1T1. 

Test whether the small electric current affected the growth of 
maize seedlings. 

We setup the null hypothesis that the electric current does 
not affect the growth, so that the observed differences are random 
observations from a population with zero mean. By actual 
calculations, we find 

42*9 

429, 


s 2 =~2 7 (x-S) 2 =10413 
n — 1 


[this is left as an exercise for the students.] 


••• '=V' /(,0) =^rTO^ I0, =^t7= 1 - 33 - 

For 13 degrees of freedom, the value of t at 5 per cent level 
of significance is 2*16, which is greater than the calculated value. 
Hence the hypothesis is not rejected. In other words, the sample 
does not provide satisfactory evidence that the electric treat¬ 
ment has made any difference to the growth of maize seedlings. 

7. The additional hours of sleep gained by using two drugs 
by the same group of 12 patients are given below :— 

Patient Additorial hours of sleep 


4*29-0 


4-29 



Drug I 

Drug 2 

1 

21 

3-6 

2 

0-2 

4-7 

3 

0-9 

1-8 

4 

3-8 

5-5 

5 

3-5 

4’6 

6 

-0-2 

-0-3 

7 

-1-3 

-0-4 

8 

-0-3 

1-9 

9 

-1-7 

2-0 

10 

0-7 

1-7 

. 11 

- 0‘8 

2-1 

12 

1-3 

1-1 


THE SAMPLING OF VARIABLES 


415 


Test whether the second drug gives, on the average, at least an 
hour more of sleep than the first drug. 

Calculation of the mean and S. D. of the differences is shown 
below :— 


Patient 

X 

y 

d—y — x 

d-d 

id-d ) 

1 

21 

3*6 

1*5 

-o-i 

0*01 

2 

0*2 

4*7 

3-5 

1-9 

3-61 

3 

0-9 

1*8 

0-9 

-0-7 

0-49 

4 

3*8 

5-5 

1-7 

01 

001 

5 

3*5 

4-6 

1*1 

-0-5 

0-25 

6 

-0-2 

— 0-3 

-o-i 

-1-7 

2-89 

7 

— 1*3 

— 0’4 

09 

-0-7 

0*49 

8 

-0-3 

1-9 

2-2 

0-6 

0*36 

9 

— 1*7 

2-0 

3-7 

21 

4-41 

10 

07 

1*7 

F0 

-0-6 

0-36 

11 

— 0*8 

21 

2-9 

1-3 

1 *69 

12 

1*3 

11 

-0-2 

— 1-8 

3*24 


191 1781 

d=~- = 1*6 and 2 {d-d)*= 17*81. 


Now we set up the hypothesis that the second drug gives, 
on the average, an hour more of sleep than the first drug /. e. 
#x=l, where /z is the population mean of the differences. 


• • 


t = 


d- 


V(d-d)* 
0-6 x 11*5 
4-2 


y/{n (n- 1 )} = 


= 1*64 


1 * 6—1 

V(17-8i> 


. \/(12x 11) 


For 11 degrees of freedom, we get from tables r 0 . 05 = 1 7 9 6. 
Since the calculated value of t is less than this value, the difference 
is not significant at 5 per cent level of significance. 

Hence there is no reason to suspect the hypothesis that the 
second drug gives, on average, an hour more of sleep than the 
first drug. 


8. For a random sample of 10 pigs fed on diet A, the increases 
in weight in pounds in a certain period were : 

10. 6, 16. 17, 13. 12, 8, 14, 5, 9 lbs. 

For another random sample of 12 pigs, fed on diet D, the 
increases in the same period were 

7, 13, 22, 15. 12. 14, 18, 8, 21, 23, 10, 17 lbs. 



416 


STATISTICS 


Test whether diets A and B differ significantly as regards the 
effect on increases in weight (or test whether the mean increases 
in the two samples are signficantly different). You may use the 
fact that 5 per cent value of t for 20 degrees of freedom is 2*09 . 

(Bombay M. A. ’58, Delhi B. A. Hons. ’5i, ’57, I. A. S. ’50) 

We first calculate the means and S. D.’s for the two samples 
as follows :— 


Diet A 

Diet B 

X 

X — X 

(*-s) 2 

y 

y-y 

(y-y) 1 

10 

— 2 

4 

7 

-8 

: 64 

6 

— 6 

! 36 

13 

-2 

4 

16 

4 

16 

22 

7 

49 

17 I 

5 

25 

15 

0 

0 

13 

1 

1 

12 ! 

— 3 

i ^ 

12 

0 

0 

14 

— 1 

1 

8 1 

—4 

16 

18 

3 

9 

14 ; 

2 

4 

8 ! 

-7 

49 

15 

3 

9 

1 » 

21 

6 

36 

9 j 


9 

23 

8 

64 

« 



io ! 

-5 

25 

1 

‘ 


17 

l 

2 

4 

120 

0 

120 

180 ! 

0 

314 


1 I so 

3= ~ = 12 1bs.,y= I ° 2 u =l5lbs. 


52= *7+^2 [r i*-*r+z o-yj-j 







THE SAMPLING OF VARIABLES 


417 



[ 120-f- 314) = 


434 
20 * 


or 


Hence 


I 


.\ s=V( 21-7) = 4-65 lbs. 



12-15 

4*65 



12 x 10\ 
12 -}- 10 / 


v = number of degrees of freedom = n { -f « 2 — 2 = 20. 
For 20 degrees of freedom, we are given that / 0 . 05 = 2-09. 
The calculaled value is less than f 0 . 05 and so the difference 
between the sample means is not significa nt. Hence the diets 
A and B do not differ significantly as regards the elfect on 

increases in weight. 


9. Twenty students were given the same test ; they were 
divided into two groups : A and B. Group A had a training in 
logic, Group B had not. Their marks were :— 


Group A : 20, 13, 5, 12, 16, 17, 8, 19, 16, 14. 

Group B : 15,17,5,18,15,8, 14,12,11, 5. 

Find, by Fisher’s method, the probability af finding the 

observed difference between the means by chance in two samples of 
10 from a homogeneous population. Is the difference significant ? 

The calculation of mean and S. D. of the two groups is 


shown below : 


X 

Group A 

x—X 

(x-S ) 2 

y 

Group B 

y-y 

(y-y) 

20 

6 

36 

15 

3 

9 

13 

— 1 

1 

17 

5 

25 

5 

-9 

81 

5 

— 7 

49 

12 

— 2 

4 

18 

6 

36 

16 

2 

4 

15 

3 

9 

17 

3 

9 

8 

— 4 

16 

8 

-6 

36 

14 

2 

4 

19 

5 

25 

»2 

0 

0 

16 

2 

4 

11 

-1 

1 

14 

0 

0 

5 

-7 

49 

140 

0 

200 

120 

0 

198 


We set'up the null hypothesis that there is no difference 
between the population means /. e = 



418 


STATISTICS 


Then 


Here 



(*—?> —0 




rti = n 2 = 10and v=an l -\~n l — 2= 18 

S '~*^k=2 {r <*-*>'+* <y-yn 

= ^{ 200 + 198 }==—* = 22 * 11 . 

. f _ 14-1 2 //10x10\ 2xV5 4*47 

V(22*ll) V V10+KV V(22-ll) 470 

= 0*95. 

I'Tom the tables, we find for v=18, 

when / = 0-9, / > =0 - 810, 
when / = 1*0, / > =0 835. 

Hence when / = 0*95, / > = 0*8225. 




Thus the chance of petting a value of t greater than the 
observed value is I-0*8225 = 0-1775, so that the probability of 
petting t greater in absolute value is 0*355 or about one in 
three. Hence the difference is not significant. 

10. Explain the uses of*t ’ distribution. 


For a random sample of 10 pips, fed on a diet A, the increases 
in weight in a certain period were JO, 6, 16, 17, 13, 12 S 14 
15, 9 lbs. * ’ ’ 


For another random sample of 12 pigs, fed on diet B, the in¬ 
creases in the same period were 7, 13, 22, 15 12 14 IS 8 21 23 
10, 17 lbs. 


Find if the two samples are significantly different regarding the 
effect of diet, given that for (d.f.) v=20, 22, 22, the five per cent 
values oft are respectively 2'09, 2'07, 2‘06. 

(Agra 1963; Raj. 59; Bombay M. A. 58; I. A. S. 50; 

Delhi B. A. Hons , 51, 57) 


For uses of /-distribution see § 17*6. 


THE SAMPLING OF VARIABLES 


419 


Calculation of mean and S. D. of the two series is shown 
below : 



Diet A 



Diet B 


X 

X — 3 

(x-3) 2 

y 

y-y 

O’-?) 2 

10 

-2 

4 

1 

-8 

64 

6 

-6 

36 

13 

-2 

4 

16 

4 

16 

22 

7 

49 

17 

5 

25 

15 

0 

0 

13 

1 

1 

12 

-3 

9 

12 

0 

0 

14 

- 1 

1 

8 

— 4 

16 

18 

3 

9 

14 

2 

4 

8 

-7 

49 

15 

3 

9 

21 

6 

36 

9 

-3 

9 

23 

8 

64 




10 

-5 

25 




17 

2 

4 

120 


120 

180 


314 



W-12, //!=10, 

Z (x — .v) 2 

— 120; 



y = 

W=15, " 2 =12, 

2 : {y-yr- 

= 314; 



V = 

+„ 2 _2= 10 + 12-2 = 20. 



We 

assume 

that the samples 

do not differ in weight so far 


as the two diets are concerned, i.e. n x — n 2 = Q- 


Hence 


(y-s)-o // fyig Y 

^ V \"i + 'i a /‘ 


Now 


so that 


, [z (x-X) 2 +r (y-yf-) 
t x d [120+314]=W = 21*7 

15 — 12 //120\_3 x V(5-4545) 

' _ VT21-7) V \;2j- V(21 ’7) 


3 X 2-335 7-005 ,. 5 

4-658 4-658 


For v = 20, the five per cent value of t is 2*09. Since the 
calculated value of t is less than this value, the two diets do not 
differ significantly as regards the increase in weight. 


11. Two horses A and D were tested accordig to the time (in 
seconds) to run a particular track with the following results : 


Horse A : 28 30 32 33 33 29 and 34. 

Horse 13 : 29 30 30 24 27 and 29. 



420 


STATISTICS 


Test whether you can discriminate between two horses. You 
can use the fact that 5 per cent value of t for 11 degrees of freedom 
is 2-20. (Agra 1962, 1. A. S. 1953) 

Calculation of the mean and S. D. of the two samples is given 
below :— 

Horse A Horse B 


X 

(*-32) 

(*-32)* 


y 

{y —29) 

(T“ 29)* 

28 

— 4 

16 


29 

0 

0 

30 

— 2 

4 


30 

1 

1 

32 

0 

0 


30 

1 

1 

33 

1 

1 


24 

-5 

25 

33 

1 

1 


27 

— 2 

4 

29 

-3 

9 


29 

0 

0 

34 

2 

4 





219 

-5 

35 


169 

-5 

31 




31*3, 

"1 = 7. 






28-1, 

" 2 =6; 





v *="rf/i. 

_ n — 

11. 






-*) 2 = 

=V— (- 

-w 


or 


n l o x z = 35 

1 

< 

(I 

= 31*43 



and 


o t *=L Z (y 

-?) 2 = 

=V—(- 

-i) 2 


or 


' 7 2 rT w* = 3 1 

_ 2 6 - 

G “ 

= 2683. 



On 

the null 

hypothesis that the horses 

do not differ. 

we get 



where 



3143-26-83 

, , — =5*3 approx. 


i.e. i — 


31j3-28*l / ( 42\ 3*2 5*76 

V<S*3T V Vl3j = V(5'3) v(3 apprOX * 


The five per cent value of t for 11 degrees of freedom is 
given to be 2’20. The calculated value of t is greater than this 
value and so the difference is significant. Hence we can 
discriminate between the two horses. We can discriminate 
between the horses only with 75 degree of confidence. 

12. The following data represent the yields in bushels of Indian 
corn on ten sub-divisions of equal areas of two agricultural plots 
in which plot 1 was a control plot, treated the same as plot II, 
except for the amount of phosphorous applied as a fertilizer : _ 



THE SAMPLING OF VARIABLES 


421 


Plot I 

Plot II 

6-2 

5-6 

5-7 

5-9 

&5 

5-6 

6‘0 

5-7 

6-3 

5-8 

5-8 

5-7 

5-7 

6-0 

6-0 

5-5 

&0 

5-7 

5-8 

5-5 


Is there a significant difference between the yields on the two 
plots , using the difference between their means as a criterion of 
judgement ? 

[Given from Student's table for n = 18. P =0 0072 for the pro¬ 
bability that t will fall outside the range —3‘034 and +3-034]. 

(Agra 1960) 

Calculation of the mean and S. D. is exhibited below : 



Plot I 



Plot II 


X 

x—x 

(*-*)* 

y 

O’- y) (y-yf 

6-2 

0-2 

0*04 

5-6 

—01 

0-01 

5-7 

-0*3 

0*09 

5-9 

0-2 

0*04 

6-5 

0*5 

0*25 

5-6 

— 01 

o-oi 

60 

0 

0 

57 

0 

0 

63 

0-3 

0-09 

5-8 

0-1 

o-oi 

5*8 

-0-2 

0*04 

5*7 

0 

0 

5*7 

-0*3 

0-09 

6*0 

0*3 

0*09 

60 

0 

0 

5*5 

— 0*2 

0-04 

60 

0 

0 

5-7 

0 

0 

5*8 

-02 

0*04 

5-5 

-0*2 

0-04 

600 


0*64 

57-0 


0*24 


o— flO 
•v — 1 o - 

= 6-0, /i, = 10, 

27 (x- 

-X)* = 0-64 ; 



y = ?5 = S-7, n a =IO, 

27 (y- 

-y)*=0-24. 


We 

take the 

null hypothesis 

that 

there is no 

difference 

between 

the yields of the two plots. 

/. e. 

the samples 

are from 


the same universe so that /x x — /x 2 «=0. 

• ?>-o 

s , VV« 1 +w 2 / 

( 6 * 0 — 5 * 7 ) / 100\»/ 2 

“✓0-64 + 0-24Y 7 * V 20 ) 

V 18 ) 



422 


STATISTICS 


=0-3x V'® =0 ' 3X 10 1'3=3-034. 

Now the probability that t will be outside the range —3*034 and 
3*034 is given to be 0*0072 which is less than both 0*01 and 0*05. 
Hence the null hypothesis that the samples come from the same 
population is refuted by the test for both the 5% and 1% levels 
of significance. In other words, our conclusion is that, on the 
levels of significance adopted, there a significant difference 
between the yields on the two plots. 

13. The heights of six randomly chosen sailors are, in inches : 
63, 65, 68, 69, 71 and 72. Those of ten randomly chosen soldiers 
are : 61, 62, 65, 66, 69, 69, 70, 71, 72 and 73. Discuss the light 
that these data throw on the suggestion that soldiers are on the 
average taller than sailors, given that 

f P=0*539 for t =0 1, 

v ~ 14 \P=0-527 for t=0'08. (Agra 1956) 

Calculation of the means and S. D. of the two samples is 


given below. 






Sailors 



Soldiers 

X 

x—3 

(x-Z) 2 

y 

y-y 

(y-y) 

63 

— 5 

25 

61 

-6*8 

46*24 

65 

-3 

9 

62 

-5*8 

33*64 

68 

0 

0 

65 

-2*8 

7*84 

69 

1 

1 

66 

-1*8 

3*24 

71 

3 

9 

69 

1*2 

1*44 

72 

4 

16 

69 

1*2 

1*44 



70 

2*2 

4*84 




71 

3*2 

10*24 




72 

4*2 

17*64 




73 

5*2 

27*04 

«08 


60 

678 


153*60 


S=4o« = 68 inches, ^ = 6, E (x-*) 2 =60 ; 

= 67*8 inches, w 2 =10, E (y—y) 2 = 153*60. 

We set up the null hypothesis that there is no difference 
the mean height of the sailors and soldiers so that n x — /x 2 =0. 

(68 —67*8)—0 //6xl0\ 0*2 


in 


/ 


^ 60 + 153 * 60 ^ 


0 //6xl0\ 

*/ 2 V \6+ioy 


V(15*26) 


V(3 75) 


14 

0*2 x 1*94 
3*9 


= 0*097. 



THE SAMPLING OF VARIABLES 


423 


d. f.= v = 6+10—2=14. 

Now for 14 degrees of freedom, we are given 

for /=0*J, P=0 539. 

for r = 0 08, P= 0 527. 

then for / = 0*099, P= 0 538. 

Hence P f = 2 [1-0 538] = -924. 

Since this value of P F is much greater than both 0 05 and 

0*01, the difference is not significant both at 5% and 1% levels of 
significance. Hence the suggestion that soldiers are on the 
average taller than sailors is wrong. 

14. Eight pots growing three wheat plants each were exposed 
to a high-tension discharge while nine similar pots were enclosed in 
an earthenware cage. The number of tillers in each pot were as 
follows :— 

Caged 17 26 18 25 27 28 26 23 17 

Electrified 16 16 22 16 21 18 15 20 

Discuss whether electrification exercises any real effect on 
tillering. (De l hi j. C# A . R# 1956) 

We calculate the mean and S. D. of the two samples. 

Caged Electrified 


X 

x — x 

(x—x) 2 


y 

y- 

-y 

(y-y) 

17 

-6 

36 


16 

— 

-2 

4 

26 

3 

9 


16 

— 

-2 

4 

18 

-5 

25 


22 


4 

16 

25 

2 

4 


16 


-2 

4 

27 

4 

16 


21 


3 

9 

28 

5 

25 


18 


0 

0 

26 

3 

9 


15 


■3 

9 

23 

0 

0 


20 


2 

4 

17 

-6 

36 






207 


160 

; 

144 



50 

.*. x 

II 

N 

©!© 

M 

II 

23, n l ~9 l 

, E(x- 

S) 2 =160; 


5 

II 

N 

a* 

* 

II 

18, w 2 = 8 

, z (y- 

-y) 2 -- 

= 50. 





V = 

"i + n 2 - 

-2= 

15. 




s 2 = 

1 

2& (* 



(y- 

■y) 2 ] 


-A [160+50] = 14. 

.... ^ assume that electrification exercises no real effect on 
tillering i.e. ^-^=0. 



424 


STATISTICS 


Hence 


(g-y)-Q / / iyr, \ 

J V \Wi+/l2/ 

(23-18) / 72__ 30 

V14 V 17 V(H9) 


= 2*751. 

The value of / for 15 degrees of freedom at 5% level of signi¬ 
ficance is 2*131 which is less than the calculated value of t. Hence 
the calculated value of t is significant and therefore electrification 
does exert some effect on the tillering. 

15. The densities of sulphuric acid in two containers were 
measured, four determinations being made on one container and six 
on the other . Do the results lead to the rejection of the hypothesis 
that the acids have the same density ? You are given that five per 
cent value of t for 8 degrees of freedom is 2'31. 

To calculate t, the following short-cut method is useful. Sub¬ 
tract 1 840 from each figure and multiply the remainder by 1,000. 
This will not affect the value of t. 



Container A 



Container B 

X 

(x-X) 

(*-*>■ 


y 

(y-y) 

(y-y) 2 

2 

-1*15 

2*25 


8 

2 

4 

6 

2*5 

6*25 


3 

-3 

9 

3 

-0*5 

0*25 


6 

0 

0 

3 

-0*5 

0*25 


7 

1 

1 





7 

1 

1 





5 

-1 

1 

14 


9 


36 


16 

• 

• « 

. 3 = V = 

3*5, ^=4, 

, 2 (X- 

-X) 2 =9; 




y= 3 e 9 = 

6 # 0 f n 2 =6 y 

Z(y- 

-y)*= 16. 





V = 


-2 = 8. 




// npiz 3*5-6 /24 

* * V Q+ifi y* V io‘ 

or 11 |=(2-5) 

= 2*5 x *8 x V(l*2)=2*19. 

Since the calculated value of t (i.e. 2*19) is less than the given 
value of t for 5 per cent level of segnificance (i e. 2 3 1), the null 
hypothesis that there is no difference in the densities of the two 
acids is not contradicited at this level. 


THE SAMPLING OF VARIABLES 


425 


16, Twelve boys were fed on diet A and 15 on diet B. The 
gains in weight for the individual boys (in pounds) were as shown : 

A : 25, 32, 30, 34, 24, 25, 14, 32, 24, 30, 31, 35. 

B : 44, 34, 22, 10, 47, 31, 40, 30, 32, 35, 18, 21, 35, 29, 22. 
Find whether diet B is superior to diet A, given that the fi ve per 
cent value of t for 25 degrees of freedom is 2’06. 

Calculation of mean and s. d. is shown below : 


Diet A Diet B 


X 

JC — X 

5 (*-*) 2 


y 

y-y 

(y-y) 2 

25 

— 3 

9 


44 

14 

196 

32 

4 

16 


34 

4 

16 

30 

2 

4 


22 

-8 

64 

34 

6 

36 


10 

-20 

400 

24 

-4 

16 


47 

17 

289 

25 

— 3 

9 


31 

1 

1 

14 

— 14 

196 


40 

10 

100 

32 

4 

16 


30 

0 

0 

24 

— 4 

16 


32 

2 

4 

30 

2 

4 


35 

5 

25 

31 

3 

9 


18 

-12 

144 

35 

7 

49 


21 

-9 

81 





35 

5 

25 





29 

— 1 

1 





22 

— 8 

64 

336 


380 


450 


1410 



*-W-28. 

= 12, 

27 (x- 

-a.') b =380; 




7-W-30, n 2 

= 15. 

27 (y- 

-y) i= 1410. 




v = n l +n 2 — 2 = 

= 12+ 

15-2 

= 25. 


On 

the null hypothesis that the 

; diets 

do not differ 

so far as 

the increase in 

weight is concerned. 

we have 



,_*-y /( ^ 

s \f \n l +n 2 J 

_ 28 — 30 / f 180^ 

^380-4-1 4 10y /2 V \ 27 /’ 

\t l == 2x 5^/( 27>< i 79 o) =10 ^/( 5 3 7 ) 
“ 1 6°38 =0 ’ 61 near,y - 



426 


STATISTICS 


Now the given value of t for 25 d. f. at 5 per cent level of 
significance is 2 06. The calculated value of / is considerably less 
than this value and as such the hypothesis is to be accepted i.e. 
the difference in the mean gain in weight on the two diets is not 
significant. Hence no diet can be said to be superior to the 

other. • ' , 

17. The ash content of coal from two different mines was 

analysed; five analyses being made of the coal from the first mine, 

four of that from the second mine. Are we justified in supposing 

that the two mines consist of coal with the same percentage of ash 

content on the basis of the results obtained, which are recorded 

in the following tables : 


Table A 
Percent 
Ash content 
24'3 
20-8 
23-7 
21-3 


Table B 
Percent 
Ash content 
18-2 
16 9 
20'2 
16' 7 


17 4 

Calculation of the mean and S. D. of the two series is shown 


below :— 

First mine 

X X — 3 

(■X-2) 8 

V 

90 

Second mine 

y-y 

(y-y) 1 

24*3 

2*8 

7-84 

18-2 

0*2 

0-04 

20*8 

-0*7 

0-49 

16-9 

-1*1 

1*21 

23*7 

2*2 

4-84 

20*2 

2*2 

4-84 

21*3 

-0-2 

0-04 

16*7 

— 1*3 

1*69 

17*4 

— 4*1 

16*81 




107*5 

0*0 

30-02 

72 7 0 

o-o 


Thus 

107*5 

'”5” = 

21*5, n,= 

5, Z(x- 

S)*=30*02 



00 

1 

PH 

11 

■0, n z =4, 

2{y-y) z = 7-78, 



v= n 1 -\-n 2 — 2=4+5—2=7. 

On the null hypothesis that the ash contents in coals of two 
mines have the same percentage, we have 

z-y / ( n x n 2 \ 
s V \n Y + nJ 



THE SAMPLING OF VARIABLES 


427 


21 * 5 — 18-0 / 20 

^ 30-02 + 7 78 y V 9 ~ 


3*5 

V(2*43) 


= 2‘245 approx. 


The five per cent value of t for 7 degrees of freedom is 
2*365. Since the calculated value of t is less than this value, 
the difference in the ash contents of the coal of two mines is not 


significant. Hence the null hypothesis is not contradicted, i.e. the 
two mines have the same percentage of coal ash content. 

18. Two chemists estimate the strength of a dilute solution 
of hydrochloric acid. Their results are an average of 10-162 with a 
standard deviation of 0’23 based on 15 determinations and 10’341 
with standard deviation of O'12 based on 24 determinations. What 
is the level of significance of the difference between the chemists’ 
estimates of the acidity on the hypothesis that their results are 
random variations about the same population mean ? 

Here S = 10* 162, // t = 15, cr x =0*23; 

y = 10*341, « 2 = 24, g„=0. 12. 
v=n l + n t — 2= 1 5-f-24 — 2=37. 


where 

or 


or 


t = 


x-y 

s 


/ ( \ 
V VWi+ /!*/• 


= n ~ +n.~2 [ + ] 


'1 


10*162-10-341 


1 5 x0*23-+-24 x 0 I2\ j 


37 


) 




'I = 


V(^l7i ) ^ (9 * 23) = 1 ’ 315 a PP r ox 


Now we find from the tables that 5% value of t for 37 
degrees of freedom is 2*027. 

Since the calculated value of t is considerably less than 
this value, the difference between the two estimates of acidity is 
not significant at this level of significance. 

19. Two tinctures of strophanthus were tested by the cat 
method, each tincture being administered to 7 cats. The mean 
lethal dose in cubic centimeters of undiluted tincture per kilogram 
of cat was 0'0168 for tincture A , and 0-0199 for tincture 8 . The 
respective standard deviations were 0-00328 and 0 00309 . Do the 
tinctures appear to have significantly different effects ? 

Here s = 0*0168, « t =7, **=0 00328; 

y = 0-0199, w 2 =7, <j v =0 00309. 

v =* /ij + //« — 2=12. 



428 


STATISTICS 


We set up the hypothesis that the two tinctures have the same 
effects. 


2-7 / ( ”l”2 \ 

0*0168—0*0199 


( 


or 


1 1 = 


7x0*00328-t-7x * 00309 ^ \/2 

12 ) 

0*0031 .. 0*0031 0*0031 


V6= 


V'(0*00106) _ 0*0325 


Vt0*00637) 

= 0*095. 

From the tables we find that for 12 degrees of freedom, 
W=2*179. 

Since the calculated value of / is much less than this value, 
the difference in the effects of two tinctures is not significant. 

20. Two independent samples of 8 and 7 items respectively had 

the following values : 

Sample 1 9 11 IS H " 9 12 14 

Sample 2 10 12 10 14 9 8 10 

Is the difference between the means of the samples significant ? 


Given that 

V = 

IP—O'874 for t- 
13 \p=0’892 fort-- 

= /*2, 

= 1 *5. 

Calculation of Mean and S.D. 
Sample 1 

is given below : 

Sample 2 

X 

(x—12) 

(x—12) 2 

y 

(7—10) 

9 

— 3 

9 

10 

0 

11 

-1 

1 

12 

2 

13 

1 

1 

10 

0 

11 

-1 

1 

14 

4 

15 

3 

9 

9 

— 1 

9 

-3 

9 

8 

—2 

12 

0 

0 

10 

0 

14 

2 

4 



94 —2 

We then have 

34 

73 

3 



*=T= n ’ 75 - 


(Agra, 1958) 


(y—10) 2 
0 
4 
0 
16 
1 
4 
0 

25 



or 8 <t* 2 =33*5; 



THE SAMPLING OF VARIABLES 


429 


n 2 =l,y = ^ = 10-43. 

or 7 (t„ 2 = 25-| = 23*7. 

. _ 11*75—10*43 //8x7\ 1 - 32 

/ 33 * 5 + 23 *7 \ 1/- ’ V V 15 J _ \/(4*4) ^(3*73) 
V 13 >/ 

= 1*2144. 

Now for 13 degrees of freedom, we have 

for t = 1*2, P=0*874; 
for f= 1 *3, / > =0*892. 


Hence for r = 1 *2144, /> = 0*8765 

P F = 2 (1-P) = 2 (1-0*8765) 

= *2470 > 0 05. 

Since the value of P F is considerably greater than 0*05, the 
difference between the means of the two samples is not significant. 

21. In three samples of 50 lines each from Shakespear's 
“Romeo and Juliet” (an early play), the following numbers of weak 
endings were observed : 7, 9, 10. In three similar samples from 
“ Cymbeline” (late), the numbers of weak endings were 15, 11, 12. 
Discuss the suggestion that Shakespear's prosody, ax judged by the 
number of weak endings, changed with advancing years. 

Here 27x = 7 + 9+ 10 = 26, n l = 3 so that 3 = y = 8*666 
and 27x 2 =49 + 81 +100 = 230. 


••• y-(W 

or 3a x *=2: (x-S)* = 230_ 6 ^ 6 = 230-225*333 

= 4*667. 

Similarly 27y = IS-f-11 -f-12 = 38, // 8 = 3. 


^9 = ^= 12*666 


and 


<£>* = 225 + 121 -fl44 = 490. 


or 


3<v = 27 (9 — 9)* = 490 — 


1444 


= 490-481 *333 = 8*667. 



430 


STATISTICS 


Hence t~ 


or 


{ 


U (x 


_ \ 

— 58) 8 4-27 (v— yi 8 ! 1 '* V 
»*+«•— 2 J 


r 


8 - 666 - 12-666 
667 + 8 *6671 1/2 


} 


v@ 


m- 


8 


V(13*334) 

Numbers of d.f. = 3 + 3 — 2 = 4 


9-7976 

x 1-2247=^7^=2*683 


3*051 


From the tables, we get for v=4, f 0 . 05 =2*776. 

Since the observed value of t is nearly equal to / 0 . 05 , the 
difference of the means is likely to be significant which supports 
the suggestion. 


Exercises. 


1. Calculate the value of */’ in the case of two characteristics 
A and B whose corresponding frequencies are 

A 16 10 8 9 9 8 

B 8, 4 5 9 12 4 

(Agra 1951) 
[Ans. /= 1*8 nearly) 

2. In a certain experiment to compare two types of pig-food 
A and B , the following results of increase in weights were 
observed in pigs : 


Pig Number 

1 

; 2 

3 1 

4 

5 

6 

_i 

I 

7 

• 

8 

Total 

Increase in 
Weight in lbs. 

Food A 

49 

53 

51 

i 

52 

H 

50 

52 

j 

53 

407 

Food B 

l 

52 

55. 

> 

i 

52 ; 

53 

50 

54 

54 

53 ! 

423 

1 


Assuming that the two samples of pigs are independent, 
can we conclude that food B is better than food A ? Examine 
the case when the same set of eight pigs were used in both the 
foods. (Bombay, M.A. 1952, Delhi B.A. Hons. 1955) 

[Ans. Food B is better) 

3. Mitchell conducted a paired feeding experiment with pigs on 









THE SAMPLING OF VARIABLES 


431 


the relative value of limestone and bone-meal for bone deve¬ 
lopment. The results are : 

Ash content in percentages of scapulus of pairs fed on 
limestone and bone-meal. 


Pair 

Limestone 

Bone-meal 

1 

49*2 

51*5 

2 

53*3 

54*9 

3 

50*6 

52-2 

4 

52-0 

53*3 

5 

46*8 

51 *6 

6 

50*5 

54*1 

7 

52-1 

54-2 

8 

53*0 

54*3 


Determine the significance of the difference between the 
means in two ways : (1) by assuming that the values are paired 
and (2) by assuming that the values are not paired. 

(Delhi I. C. A. R. 1956) 
[Ans. The difference is significant; r=4 444] 

4. The marks obtained* by 20 students cf college A and 15 
students of college B in a mathematics test are given below : 

College A : 89, 76, 63, 69, 55, 71, 84, 87, 88, 52, 47, 81, 32, 

43, 86, 29, 49, 73, 80, 44. 

College B : 79, 61, 36, 42, 50, 12. 55, 81, 35, 73, 22, 90, 76, 

67. 62. 

Do you think students of college A are more proficient 
in mathematics than students of college B ? Given that for 
v (d. f.)=c33, t 0 o6 = 2 036. [Ans. No; / = 1 - 275) 

5. In the manufacture of steel by the open-hearth process, 63 
casts charged with refined iron showed an average charge 
to tap time of 11 *35 hours with a variance of 1*07. Then 
the same number of casts charged with basic iron showed 
a corresponding mean and variance of 10 63 and 0 75. 
Dots the quality of iron used in charging afreet the charge 
time ? 

6. The number of minutes required for a group of 15 pupils 
to complete a test were as follows : 

12, 14, 15, 16, 16, 18, 19, 19, 19, 20, 21, 24, 27. 29, 31. 



432 


STATISTICS 


A second group of 20 required the following numbers of 
minutes for the same test : 

10, 12, 13, 14, 14, 17, 17, 18, 20, 21, 21, 21, 21, 22, 22, 23, 

24, 24, 25, 30. 

Do the groups vary materially with respect to time 
required ? 

7. The means of two random samples of sizes 9 and 7 respec¬ 
tively are 196*42 and 198*82 respectively. The sum of the 
squares of the deviations from the means are 26*94 and 18*73 
respectively. Can the samples be considered to have been 
drawn from the same normal populations, it being given 
that the value of t for 14 d. f. at 5 per cent level of signifi¬ 
cance is 2* 145 and at 1 per cent level of significance is 2*977. 

(Lucknow B. A. and B. Sc. 1940) [Ans. No; t = 3 175] 

8. How can 7’ test be applied for testing the significance of the 
difference between two sample means ? Calculate the value 
of t in the case of two characters A and B whose corres¬ 
ponding values are 

A: 41,49, 34,36.46,50.36,20,18. 

B : 46 , 44,30. 35, 26, 28, 29. [Ans. / = 0*577] 

9. In a test examination given to two groups of students, the 
marks obtained were as follows : 

First Group : 18, 20, 36, 50, 49, 36, 34, 49, 41. 

Second Group : 29, 28, 26, 35, 30, 44, 46. 

Examine the significance of difference between the arith¬ 
metic averages of the marks secured by the students of the 
above two groups. (P« C. S. 1951) 

10. The following figures show the mean breaking strength 
index of two types of cotton fibre. Taking the figures in pairs, 
determine whether the mean difference is significant. 


Index of Breaking Strengths 


Fibre A 

7*2 ; 

7*7 1 

• 

7 8 7-6 

] 

7*3 1 

1 7*8 

7*7 

7*6 

1 

7*4 

i 

Fibre B 

7 2 

, 

7*2 

) 

7-1 j 7-3 

7*3 

7*3 

7*6 

7*2 l 

7 9 


[Ans. Not significant; 1 = 0*79] 

11. The yield of wax extracted from peat depends on the nature 
of the solvent used to 'extract the wax. Two solvents were 



THE SAMPLING OF VARIABLES 


433 


tested on five samples of peat. The table shows yield of 
crude wax (per cent dry peat). Is the effect significant ? 


Sample 

1 

i 

2 

3 

4 

! 5 

Solvent A 

2*3 

10*4 

2*4 

6 8 

i 

4*1 

Solvent B 

3*2 

11 8 

^ 2*7 

8*6 j 

5*2 


[Ans. 1=4*4, significant on 5 per cent and not on 1 per 

cent level of significance] 

17*11. To test the significance of the ratio of two independent 
estimates of the population variance —the z-test. 

Two independent random samples of sizes n x and n, are 
drawn from two normal populations having the standard devia¬ 
tions c 1 and o 2 . We have to test the hypothesis that a,=a 2 . 

Assuming the null hypothesis a x = a 2 , Fisher showed that the 

statistic z=Hog e ~ ...(1 ) 

S 2 

or z = log e ~' = log I0 ^xlog, 10 = 2*3026 log,,, S \ ...(2) 

is distributed according to the law. 


^ y 2 )l ( v i+ v «)* 

where s x 2 and sJ are the estimated variances. 



i.e. s x 2 =—Z (x — s) 2 and s 2 2 = —!—- 27 {y — y/ 

w | — I n*2 — I 

and Vj and v 2 are the number of degrees of freedom, i.e. v 1 = n I — I 
and i' 2 = n 2 — 1. 

Here s x 2 is the larger variance. Larger variance is always 
divided by the smaller, so that z is always positive. 


Note. If one of the population standard deviations is known 
or is specified by hypothesis, this means that the corresponding 
number of degrees of freedom is infinite. 


17*12. Fisher’s z-Tables of points and the significance test. 

Wc take y 0 so that the total area under the curve given by 
(3) is unity. The probability that we get a given value z 0 or 
greater on random sampling will be given by the area to the 







434 


STATISTICS 


right of the ordinate at z 0 . Tables for this probability for various 
values of z are not available, since this probability is difficult 
to evaluate, since it depends upon two numbers v x and v a . 

Fisher has prepared tables showing 5 per cent and 1 per cent 
points of significance for z. Colcord and Deming have prepared 
a table of 0*1 per cent points of significance. Generally, these 
tables are sufficient to enable us to gauge the significance of an 
observed value of z. 


It should be noted that the z-tables give only critical values 
corresponding to light-tail areas. Thus 5 per cent points of z 
imply that the area to the right of the ordinate at the variable z is 
0’05. A similar remark applies to 1 per cent points of z. In 
other words, 5 per cent and 1 per cent points of z correspond to 
10 per cent and 2 per cent levels of significance, respectively. 

17-13. F*Distribution. We define F by the relation 



where s x 2 and are unbiased estimates of population variances 
arising from two samples assumed to have been drawn indepen¬ 
dently from two normal populations having equal variances 
i. e. o 1 =a 2 . The larger of the two variances is placed in the 
numerator. The distribution of F is given by 

y=y 0 ffv,—2)/2(i+I. f)“ ( v ‘ +v ’ )/2 , .. .(2) 

where y 0 , calculated from the condition that the total area under 
the curve is unity, is given by 


and v x = n l — \ 
freedom. 


and 



(gr -p t-') 

r (?) r G-) 


v 2 =n 2 —1 denote the number of degrees 



The curve F is skew with a range from 0 to oo. The ^distri¬ 
bution is merely a transformation of the original Fisher’s 
z-distribution by means of the relation 

z = J log c F. 

17*14. Snedecor’s F-tables and significance test. 

These tables like 2 -tables give 5 per cent and 1 per cent 
points of significance for F. Like 2 -tables, 5 per cent points of 
F mean that the area under the curve to the right of 



THE SAMPLING OF VARIABLES 


435 


the ordinate at a value of F is 0'05. Thus these tables also give 
only one-tail test. In most of the problems arising in the analysis 
of the variance (see the next chapter), we are interested only in 
significantly large values of F f or which one-tail test is necessary. 
Hence we may obtain the desired criteria directly from the table. 
On the other hand, if we are testing the hypothesis that the 
population variances are the same, we must use both tail-areas 
under the f'-curve. In such cases these tables provide 10 per 
cent and 2 per cent levels of significance. If, however, a 5 per 
cent or a 1 per cent level of significance is needed, a rough 
approximation for these may be obtained from these tables by 
interpolation. 

Note. It is useful to note that r-test should be applied first 
to see whether the two population variances are the same. If 
this test gives a favourable verdict about the hypothesis o x = o_., 
then /-test should be applied to test the significance of the 
difference between two population means, since this test depends 
on the equality of the population variances. 

17*15. Solved Examples. 

1. Show how you would use Student's (•‘test and Fisher s 
2 test to decide whether the two sets of observations 

17. 27, IS. 25. 27, 29.27, 23, 17 
and lb, 16, 20, 16, 20. 17, 15, 21 

indicate samples from the same universe. (Agra M. Sc. ’49) 


We 

calculate the mean and 

S. D. 

of the two series 

a 

follows : 

— 






1st observation 



2nd observation 


X 

(x —23) tx 

-2?)'-’ 

y 

(>’-16) (y- 

■ 16) 

17 

— 6 

36 

16 

0 

0 

27 

4 

16 

16 

0 

0 

18 

-5 

25 

20 

4 

16 

25 

2 

4 

16 

0 

0 

27 

4 

16 

20 

4 

16 

29 

6 

36 

17 

1 

) 

27 

4 

16 

15 

~ 1 

1 

23 

0 

0 

21 

5 

25 

17 

— 6 

36 




210 

3 

185 

141 

13 

59 



436 


STATISTICS 


*=^ ) =23*333, y=~ = 17*625. 
E (x—23 ) a _[2 jx-73)}* 


s, = 


<«» —1) 
185 9 


"i 


8 


= IL 4 =23 

9x8 8 ’ 


Similarly, 


s ^ -— — 

3 7 


59 .nxil^os^ 

8x7 56 D ^ 1U/ * 


We first use z-test to see whether the population variances are 
the same. 

s 2 23 

z=h log e ^- 2 =i log g.^ Q - f == ° , ' 724 » using Io S tables. 


Now 5 per cent value of z for i>, = 8 and v 2 =7 is 0*6576. 

And 1 per cent value of z for these degrees of freedom is 
0*9614. 


Since the calculated value of z lies between these two values, 
the variance ratio is significant for 5 per cent points and not 
significant for 1 per cent points. Hence so far as the 1 per cent 
point level is concerned, the two population variances are the 


same. 

We now apply /-test to gauge the significance of the diffe¬ 
rence between the population means. On the hypothesis /*i=/* 2 , 
we have 


where 


Hence 


s V V?i+« 2 /’ 

^2 = K —i) y+(/*i—i) v 

n x 4- « 2 —2 

_8 X 23+7 x 5* 4107 _ 221*8749 
9 + 8 — 2 ~ 15 

= 14*7916. 

5 = 3*846. 


5*708 

3*846 



v=w, + /7 2 ~2= 15. 

For 15 degrees of freedom, we have 

'o-o 5= 2 ‘1 3 1 and /o-oi= 2 ‘947. 

Since the calculated value of / is greater than both these 
values, the difference between the population means is significant 
i. e. the two samples do not belong to the same population. 



THE SAMPLING OF VARIABLES 


437 


2. Two samples of sizes 9 and 8 give the sum of squares of 
deviations from their respective means equal to 160 inches square and 
91 inches squares respectively. Can they be regarded as drawn from 
the same normal population ? 

We have 

Z (x-S) 2 =160 and Z (y-yf=9\. 
s J 2 = x ^ = 20 and j 2 * = V = 1 3. 

Hence F=ff= 1’54 approx. 

Now for *-, = 8 and v 2 =7, we have 

^o*o5 = 3*73. 

Since the calculated value of F is considerably less than this 
value, the population variances are not significantly different. It 

follows that the two samples caD be regarded from two normal 
populations with the same variance. If the two populations are 
to be the same, their means should also be the same since mean 
and standard deviation completely specifies a normal population. 
Whether the population means are the same, can be seen by apply¬ 
ing /-test provided we know the sample means. 

3. In a sampling experiment s x =3'6, s z = 2 0 and v i = 5, v.,= l(). 
Is the difference between and s 2 significant at the 5 per cent level ? 

Here F=^=(l*80)»=3*24. 

s 2 

Now 5 per cent value of F for v, = 5 and v 2 = 10 = 3-33. 

Since the calculated value is less than this value, the diffe¬ 
rence between s x and s 2 is not significant at the 5 per cent level, 
z-test can also be applied. 


4. Two gauge operators are tested for precision in making 
measurements. One operator completes a set of 26 readings with a 
standard deviation of 1’34 and the other does 34 readings with a 
standard deviation of 0*98. What is the level of significance of this 
difference ? You are given that for v x = 25 and v 9 —33. z 0 . 0& ^U'306 
and z 0 . ol =0'432. 

Here a c = 1*34, n x = 26. 


• • 


and 


= Ve.)-\/2>- 34 

a ¥ = 0*98, n 2 — Z 4. 

~ \/ (n 2 ”~ l) ° v= \/ 33 X °’^ 8 - 


S. ^=.2-3026 log,„ 


Hence z = lo 



438 


STATISTICS 


-■•.{n/GHCK 1 !} 

— 2 3026 [h {*ogio 13 + log 10 33—log 10 25—log l0 17} + logi 0 1 34 

—log 10 0-98] 

— 11514 {1*11394 +1*51851-1*39794-1-23045} 

+ 2-3026 {0-12710-1-99123} 

= 1 • 1513 X 0-00406 + 2*3026 X 0* 13587 
= 1*1513 [0-00406+0-27174] 

= T 1513x0*27580=0*3175. 

Since the calculated value of z is just greater than z 0 05 and 
less than z 0 . 01 , the difference between the standard deviations is just 
signilicant at 5 per cent level and not significant at 1 per cent 

level. 

F-test can also be applied. 

5 . The variability in weight in 2 lbs . packets of a guaranteed food 
is expressed by a standard deviation of 0'05 oz. To test this a sample 
of 25 packets was picked and weighed giving the following results. 


r 5 of 25 packets in 
52'11 31-97 

ounces : 

32-18 

32-05 ' 

3210 

3203 

32-25 

3207 

3207 

3215 

32-05 

32-14 

32-19 

3T96 

3203 

31-98 

3207 

3199 

3209 

32-08 

32-16 

32-05 

32-18 

32-04 

31-98 


We set up the null hypothesis that the standard deviation of 
the population is 0‘05. 

In calculating the s. d. of the sample, the work may be simpli¬ 
fied by expressing the weights as deviations from 32 and multipli- 
ing by 100. If we denote the new figures by x, we get 

x : 11, 3, 5, -2, 16, —3, 25, 14, 7, 3, 18, 7, 19, —1, 18, 5, 7, 
-4, 9,4, 10, 15, 3,8, -2. 

We then have 

27 x= 195, 27 *- = 2.951. 


Hence 


(195 V 2 

27 (*—S) 2 =2951 ——"- = 1430. 
' T (*-*)=='^°= 59-58. 


Here the unit of s is T oo oz. 

Therefore expressing s in units of ounces, we get 

.s- = 0*005958. 




THE SAMPLING OF VARIABLES 


439 


Also g 2 =(0*05) 2 =0'0025. 

We divide the larger variance by the smaller variance. 


s 2 0*005958 
0*0025 


= 2*3832. 


And z=\ log„ — = 1*1513 log 10 * 2 


= 1*1513 log 2*3832 = 0*4342. 

Since one of the population variances (i. e. a) is known, the 
corresponding number of degrees of freedom is infinite i.e. »> 2 = ». 
Also v t = 24. From the tables, we find that for these degrees of 
freedom, we have 

^o # 05 = 0*2085 
and z 0 01 = 0*2913. 


Since the calculated value of z is greater than both r 0 . 05 and 
z„. (1 , the result is significant and the hypothesis that a = 0 05 oz. 
is rejected. 


6* Two experimenters, A and B, toke repeated measurements 
on the length of a copper wire. On the basis of the data obtained 
by them, which are given below, test whether B's measurements are 
more accurate than A s. (It may be supposed that the readings 
taken by both are unbiassed ) 


A’s measurements 
(in mm.) 


B s measurements 


12'47 

12-44 

12-06 

1234 

11-90 

1213 

12-23 

12-46 

12-77 

11-86 

12-46 

12-39 

11-96 

12-25 

11-98 


1278 

12-29 

12-22 


It is given that the readings of both 

the experimenters are 

unbiassed. Hence 

B's measurements will 

be considered more 

accurate if their 

population variance is 

less than 

that of A’s 

measurements. 

Then the null hypothesis is that the two 

population variances are the same /. e. c, 2 = 

= G.f. 


On this hypothesis, we have 




N iN 

*rl £ 

ii 


...0) 


with = — 1 =9 and 


W 2= w 2 — 1=7. 


Let us denote the variates of the two series by x and y. We 
set up two new variates defined by the relations w=100(x—12) 
and v = 100 (>-—12). 



440 


STATISTICS 


Then we calculate the s. d.’s of the two series as follows 


A’s 

measurements 

B's measurements 

u 

u 2 

V 

V 2 

47 

2209 

6 

36 

-10 

100 

23 

529 

77 

5929 

46 

2116 

-4 

16 

— 2 

4 

78 

6084 

22 

484 

44 

1936 

34 

1156 

13 

169 

46 

2116 

- 14 

196 

39 

1521 

25 

625 



29 

841 



285 

18105 

214 

7962 


Hence 


and 


Hence 


V=j{l8105- ( ^ a } 

= $ {18105—8122}= 1109 
* a 2 =* {7962—5724} = 319*7. 


3-68 


Now vj = 9 and v 2 =7 ; we get from the tables F 0 . 05 
and fw = 6*72. 

Since the calculated value of F is less than both F 0 . 0S and 
F 0 . 01 , the result is insignificant at both the 5% and 1% levels. 

Hence there is no reason to suppose that B 's measurements 
are more accurate than /4’s. 

7. It is known that the mean diameters of rivets produced by 
two firms, I and II, are practically the same, but the standard 
deviations may differ. For 22 rivets produced by Firm /, the 
standard deviation is 2'9 mm., while for 16 rivets manufactured 
by Firm II the standard deviation is 3'8 mm. Do you think 
products of Firm I are of better quality than those of Firm II ? 

Here <7*=2*9, /? 1 =22 ; 

C7 v = 3*8, Wo =16. 
n 


and 


1 - 1 a x a =f*X(2*9) 2 =8*805 
n \ — 1 

c, 3 =if X(3'8) ! = 15-393 
n 2 — 1 v 


jo 2 15*393 


F— - = 


s i 2 


8*805 


= 1*74 approx. 


THE SAMPLING OF VARIABLES 


441 


For 15 and 21 d. f., we get from the tables, 

^ 0 * 05 — 2 * 18 and ^o. 01 = 3'04. 

Since the observed value of F is less than both F 0 . 05 and F 0 . 0l , 
the result is insignificant both at 5% and 1% level. Hence it 
cannot be said that the products of firm I are of better quality 
than those of firm II. 

8 . A group of 10 palieris and another group of 17 patients 
each in the same stage of a given disease was put under different 
treatments. The mean blood counts of these two groups were 12000 
and 14000 respectively and their standard deviations were 1500 and 
1000 respectively. 

Fisher's t-test shows that the difference in blood counts is 
highly significant. Can this be attributed, in part at least, to a 
difference ui variability ? 

Here x= 12,000 /7 t = 10, o t — 1500; 


y= 14,000 n e = 17, 


ct „= 1000 . 


_ x-9 / ( f h>h \ 

{ n\<*z -f-rt 2 cv\j \ n \~V n %) 


{ 


12000-14000 // 

10 X 2251)000+ 1 7 X lO UOOOOp V V 


170 

27 


) 


or 


* I = 


2000 


n/(£W( 


100V58 

= 6 * 6 . 

Now from the tables we have for v = 25, 
Vos = 2’060 and r ft . nl = 2‘787. 


170 

58x27 


)= 


20x0-33 


0*01 


Since the calculated value of t is considerably greater than 
both t 0 . l6 and / 00 i» the difference between the means is highly 
significant. 

We now apply F-test to see whether the difference between 
variances is significant. 


Here 


*_ 


n 


s, — 


and 


s,* = 


n 2 — 1 


1 . trJ = ~ X 2250000. 


■„ a = |Jx 1000000. 

ID 




10x2250000x16 40 


j 2 , ’~9 x 17 x 1000000 
v, =9 and v 2 = 16, we have 
/V 0S =2 54 and F 0 . 01 = 3 78. 


= i7 = 2 ' 353 - 


Also for 


442 


STATISTICS 


Since the calculated value of Fis less than both F 0 06 find F 0 . 01 
the difference between the variances is not significant. Hence it 
cannot be said that the difference between the means is due to 
a difference in variability. 

9. Two random samples drawn from two normal populations 


are : 

Sample I : 20, 16, 26, 27, 23, 22, 18, 24 25. and 19. 

Sample 11: 27, 33, 42, 35, 32, 34, 38, 28, 41, 43, 30 and 37. 
Obtain the estimates of the variances of the populations and 
test whether the two populations have the same variance. 

(I. A. S., 1956) 


By actual calculations, we find that, 

*=22,//, = 10, 2 (jc— *) 2 = 120. 
y = 35, n 2 = 12, 2 (y-y? = 3l4. 

This is left as an exercise for the students. If s x z and s 2 2 are 
the unbiassed estimates of the population variances, we have 


5 2 = 


2 ( x -*) 2 120 

n x — 1 ~ 9 • 


i=9 


and 


2 2 (y-y) 2 314 

5 2 = ~ =Ti~» v 2 = 11 

n 2 — 1 11 


Hence 



9 

120 


= 2*14 




From the tabes we have for v a =ll and v, = 9 where v 2 
corresponds to the greater variance, 

Fo-o5=3* 10. 

Since the calculated value of F is much below F 0 . 06 , the 
difference between the variances is not significant, i.e. the two 
populations have the same variance. 


Exercise 


1, Two samples are composed of 7 and 9 individuals respecti¬ 
vely, and have variances 9*6 and 4*8 respectively. Is the 
variance 9 6 significantly greater than the variance 4*8 ? 

[Ans. The first variance cannot be regarded as significantly 
greater than the second. 

F=2*075, F 0 05 = 3*58, F 0 . 01 =6*37 
or alternately, z=0*3648, z 0 . 06 =0*6378, z 0 . 01 =0*9259] 

2. The monthly consumption of s-crap by the iron and steel 
industry for two sets of 12 months are shown below. Is 



THE SAMPLING OF VARIABLES 


441 






there any significant difference in the variability of consump¬ 
tion in the two years ? 

Consumption of Scrap (1,000 tons) 

April 1946—March 1947 : 

153, 158, 143, 138, 134, 144. 154, 160, 144, 144, 123, 148. 
April 1947—March 1948 : 

145, 150, 158, 129, 139, 162, 166, 164, 145, 169, 174, 175. 
(Ans. Not significant. 

F— 1 *95, /v fl5 = 2-82, F,. 01 = 4 46 ] 
Seven shells were fired from a 75 mm. gun and their velocities 
showed a variance of 150. The velocities of six shells fired 
from the same gun but with a different brand of power 
showed a variance of 120. Test whether this difference in 


their variability is unusual. 

(Hint 5, 2 = | x 150; j 2 * = gx 120 


7 x 150 x 5 

6 x 6 x 120 


=1*215 


^o-o 5 = 4 'for ^ = 6 and p* = 5 

Hence the difference is not significant ] 
Tests for breaking strength were carried out on two lots 
of 7 and 9 steel wires respectively. The variance of one 
lot was 230 and that of the other was 492. Is there a 
significant difference in their variability. 

[Ans. Not significant, F=2'06, F 0 . 05 =4*15]. 

The students of the same age of two different colleges were 
tested for variability of intelligence The I. Q.’s of 25 
students from one college showed a variance of 16 and 
those of an equal number from the other college had a 
variance of 8. Discuss whether there is any significant 
difference in variability. 

[Ans. Just significant at 5 per cent level and not significant 
at 1 per cent level. 

F= 2, F 0 06 =l*98, F 0 . 0l = 2 62.] 

In two groups of ten children each the increases in weight 
due to two different diets in the same period, were in pounds 

8, 5, 7, 8. 3, 2, 7, 6, 5, 7 
3, 7, 5, 6, 5, 4, 4, 5, 3, 6. 


Discusss whether there is a significant difference between 

their variability. 

[Ans. Not significant. F=»2’4, F 0 . 05 = 318]. 



444 


STATISTICS 


17 16. To test the significance of correlation coefficient (Small 
samples). 

Suppose the variates x and y are distributed in the bivariate 
normal form with means \i x and p 2 and standard deviations o x and 
o 2 and have correlation coefficient p. A sample (^ lt j^), (x it >’ 2 ), 

. t (x n% y n ) is drawn from this population, the n pairs of values 

of x and y being random and independent observations. We 
have to test the hypothesis that p=0. 

We define t by the relation 

. rV(n-2) m 

where r is the correlation coefficient in the sample 


1 s ( x-x ) o>—y) 

n _ 

i.e. r== Ti 1 ,/2 (1 

Assuming p=0, it can be shown that t defined by (1) is dis¬ 
tributed as a Fisher’s t with n-2 degrees of freedom. Hence 
the tests of significance based on Fisher’s distribution will be 
applied to test the significance of an observed correlation 
coefficient. 

Example 1. A random sample of 15from a normal popula¬ 
tion gives a correlation coefficient of - 0*5. Is this significant of 
the existence of correlation in the population ? 

(Delhi B. Com. 1954) 


Here r= — 0‘5 and n = 15. 

ry/{n-2) -0-5VU5-2) 


or 

and 


in= °Jy'(j3) 

11 V(75) 

v= 15—2= 13. 

rom the tables we get for 13 d. f, 

/„.«*= 2 * 16 . 


» 


Since the observed value of / is less than f 0 . 08 , the sample 
correlation coefficient is not significant to warrant the existence 
of a correlation in the population. 

Example 2. Show that in samples of 25 from an uncorrelated 
normal population the chance is 1 in 100 that r is greater than 
about O'43. 




THE SAMPLING OF VARIABLES 


445 


Here 


// = 25, r=0-43. 


rx/(n-2) 0*43 x (23) 

V(1 — r *)~ y/[\ —(0*43)*] 


0*43 x 4*7958 
~V(1-0 1849) 


= 2 3 nearly. 


Also we find from the tables that the probability that 
t > 2*492 is 0*01, i.e. 1 in 100. Since the calculated value of / is 
nearly equal to this value, the result follows : 


Example 3. Find the least value of r in a sample of 18 pairs 
from a bivariate normal population , significant at 5% level. 


or 


We have from the tables for 16 degrees of freedom, 

^o*o5 == 2*06. 


Also 


#V(18-2) 4 r 


Hence for the 5% level, we must have 

| / | > 2*06 
4 | /•_!_ „ 
a/U— r*j " 


2*06. 


This gives - | r | > 0*47. 

Thus the least value of | r | =0*47. 


17*17. Test of significance of correlation coefficient based 
on Fisher’s z-transformation (Large samples). 

Let r and p be the correlations in the sample and the popula¬ 
tion. Then it has been shown that the distribution of r is not nor¬ 
mal and its probability curve is very skew is the neighbourhood of 
p= ± 1 even for large values of n. 

Fisher used the following transformation : — 

r=tanh z, p = tanh 

so that z = \ log, J~ = l*1513 log I0 j— ...(1) 


and 


log, —*=1*1513 log 10 ]+£. 


...( 2 ) 


He showed that the distribution of z given by (1) is approxi¬ 
mately normal with mean £ and standard deviation —- 

This approximation is quite good for large values of n, say > 50. 
But it can be used for many practical purposes even for small 
samples. 



446 


STATISTICS 


It follows that 


1/V ( n ~ 3) 


is a standard 


zero mean and unit standard deviation. 


normal variate with 


Hence if | (z— %)\/(n— 3) ( > 1*96, the difference between 
P and r is significant at the 5 percent level. If it is greater 
than 2*58, the difference is significant at 1 per cent level. 

Note. We have seeD in the theory of variables for large 

samples that the standard error of r is -. It is to be used 

v» 

with utmost reserve for values of r near unity, since the distri¬ 
bution in such a case is markedly skew unless n is very large, 
say, at least 500. When there is any doubt, the alternative test 
discussed above based on Fisher’s transformation must be used . 

It should be further noted that the s) mbol z used here is 
different from Fisher’s r-distribution. 

Example 1. What is the probability that a correlation coeffi¬ 
cient oj 0 75 or less can arise in a sample of 30 from a normal 
population in which the true correlation is -f- 0*9 ? 

Here r= 0 75, p=0‘9 and n = 30, 


Z-1*1513 log,, j^'=l-1513 log,, 

= 1-1513 [logjQ 1 75—*log 10 0*25] 

= 1-1513 [0*24304- 1-39794] from log tables 
= 0*973 approx 

S=1 1513 Jog I0 y^=115l3log 10 l? 


= 1*1513 [log 1-9—log 0*1] 

= 1 1513 [0-27875+1] 

= 1*1513 X 1*27875= 1*47 approx. 


Hence 


z-l 

) /vl"-3) 


= (1-471-0*973) -v/(27) 


= 0 498 x 5*196=2*59 approx. 

From the tables of areas of normal curve, we find that the 
area to the left of the ordinate at x = 2*59 is 0*9952. 


Hence the area to the right of x=2*59 is 1 —0 9952 = 0 ,, 0048 . 
which is the required probability that r ^ 0*75 or that 

I (z-Z)V(n- 3) [ > 2*59. 

Example 2. The correlation between the price Indices of animal 


THE SAMPLING OF VARIABLES 


447 


feeding-stuffs and home-grown oats in a sample of 60 members is 
0 ' 68 . 

Could the observed value have arisen — 

(a) from an uncorrelated population , 

(b) from a population in which true correlation was 0 8 ? 

(a) Here p = 0, so that { = 0. 


Also z=\ log, 

' Vl-0 687 


= '- |5I3 'o^0T2 

= 0*829, using log tables. 

The standard error of z =-__ J__o n 

V ('* — 3; \/57 u ‘• 3 * 

• _0 829-0_ aiu 

** I/V(« — 3)~ 0*13 *~ 6 38 approx. 

Since the deviation of z from £ is more than six times the 
stardard error, the hypothesis is not correct. /. c. tile population 
is correlated. 

(b) Here ?=' log, L+'’ 

z i — p 

= 1*1513 log 10 ^=1*099, using log tables. 

Hence in this case, | (r —3) |—£(0*829 — I 099jJ 


0-270 
0 13 


= 2*08. 


It follows that the difference between z and ^ is about two 
times the standard error and as such can be attributed to sampling 
fluctuations. In other words p is likely to be less than +0 8. 

Example 3. A correlation coefficient of 0 7 is discovered in a 
sample of 28 pairs Apply z-lransformation to find out if this 
dffers significantly (a) from 0, (b) from 0 5. 

(a) Here p = 0, so that C = 0 

±=1 1513 log 10 |~^ 7 y = 0-87 from log tables. 

• z-S 

** UV("-3) 

= 4*35 > 3. 

Hence the hypothesis of zero correlation is refuted. It follows 
that the population is correlated. 


and 


= (0 87 - 0)^25 



448 


STATISTICS 


(b) Here p=0*5, £= 11513 log 10 (J4^|) 

=055. 

Hence ,-^^=(0 87-0-55)^(28-3) 

= 1-6 <1-96. 

It follows that the difference between z and £ is not signi¬ 
ficant and as such the hypothesis that p=0*5 is not denied. 

17*18. Significance of the difference between two independent 
correlation coefficients. 

If the two samples of sizes n x and n 2 give correlation coeffi¬ 
cients r x and r 8 we have to test the hypothesis that the samples 
are drawn from the same population or from two populations 
with the same correlation coefficient. If the hypothesis is true, 
the statistic, 



where z,=* log, and z 2 =$ log, is approximately distri- 

l—r l I— r 2 

buted normally with zero mean and unit standard deviation. 

If the numerical value of this statistic is greater than 1'96, 
the difference is significant at 5% level. 

Example 1. The first of two samples consists of 23 pairs and 
gives a correlation of 0’5 while the second of 28 pairs has a correla¬ 
tion of O'8. Are these values significantly different ? 


and 


Here z l= 1*1513 log 10 J-^|=0'55 

z,= 1*1513 log (J t|)=110. 
«!= 23, n 2 =28. 


t= 


Ci 4 —3 — 3 ) 


_ 0-55 0-SS o 

,/2 “V(sV+A)“ 0-3 


Hence the difference is not quite significant at 5% level and 
consequently the hypothesis is not denied. 


Example 2. The correlation coefficient between temperatures 
of rice and breakage percentage calculated from two samples of 12 
and 16 are 0’8912 and 0'8482 respectively. Do the two estimates 
differ significantly ? 



THE SAMPLING OF VARIABLES 


449 


Here 


and 


r,=0*8912, /?!= 12 ; 
r 2 =0-8482, w 2 = 16, 


*,-1-1513 108,. Infill— 1 -4276 
z 2 = 1-1513 log,. ^^||||=l-2496. 


Hence 



which is considerably less than 1*96. 
significant at 5% level. 


1-4276-1*2496 . 

V(0*1880) 

Hence the difference is not 



CHAPTER XVIII 
ANALYSIS OF VARIANCE 

18*1. Meaning and definition. Experimental design consists 
of three processes of planning—experiments, analysing the results, 
and interpreting the results. In this chapter we shall be concerned 
with the interpretation of results which is a matter of statistical 
inference. The technique for making inferences is known as the 
analysis oj variance. This powerful technique was developed by 
R. A. Fisher for separating the experimentally observed variance 
into a number of components traceable to specific sources. 
He defines it as “the separation of the variance ascribable to one 
group of causes from the variance ascribable to other groups.’* 
Essentially, analysis of variance provides a test of the homogeneity 
of a set of data. In an experiment generally there are several 
factors at work, each one of which may cause a certain amount of 
variability in the observations made. The aim of analysis of 
variance is to find how much of the total variability is due to 
each factor and by comparing these contributory amounts of 
variation, we can test the homogeneity of the observations. By 
homogeniety of the data, we mean that all the observations are 
drawn from the same normal population. 

18*2. Variance within and between classes. Suppose a 
random sample has been taken from a normally distributed 
population. We subdivide this sample into a number of sub¬ 
samples (or classes) on the basis of a set of conditions. The first 
step in the analysis of variance is to separate the total variation 
in the whole number of observations into two parts : (1) the 

variance between the classes or the variance attributable to the 
different conditions, (2) the variance which arises from individual 
differences within the classes. The variation between classes is 
due to assignable causes whereas the variation within classes is 
due to various chance causes. The aim of analysis of variance is 
to determine whether there is a significant difference between 
class means in view of the variability within the separate classes 
(individual differences). 

For example, a herd of cows may be divided into several 



ANALYSIS OF VARIANCE 


451 


classes according to their breeds and the amount of milk-yield of 
each cow over a given period may be recorded. We may then 
examine the variation between the mean milk-yields of different 
breeds and the variation within breeds. Here the criterion of 
classification is only one i. e. the breed of the cows. There may 
be more than one criterion of classification. We may, for 
example, have to analyse, not only the influence of breed on milk- 
yield, but also that of different varieties of foodstuffs. 

18 3. One criterion of classification. Suppose a sample of 
N values of a given variate x is sub-divided into k classes, 
according to some criterion of classification. Let the ith class 

consist of n f members and let the y'th member of ith class be 

k 

denoted by x t j. Then £ n t =N. We write the sample values as 

i = l 

follows : 


Classes Values 


1 

*11 

*12 

• • • 

x ti 

• • • 


2 


*22 

• • • 

X.^f 

• • • 

X 2 n. 

• • • 

• 

i 

• • • 

*<» 

• • • 

*12 

• • • 

• • • 

*.i 

• • • 

• • • 

X in, 

9 9 • 

k 

0 9 9 

**I 

• • • 

**2 

• • • 

• • • 

X*! 

• • • 

• • • 

X ka t 


Let the general mean of all the N values be X and the mean o. 
the /th class be .v',. 

Now £ (x ii -x) 2 =£ (xn-Xi + Xi-X) 2 

j-i j=\ 

= £ ( x ii -X,)* + 2£ (*„-*,) (*,-*> 

+ 27 (X t -X)* 

*=£ (*„-*,) 2 +/i, (.v.-.Y) 3 

[V 2 £ (x {f -x,) (x t X) 2 

= 2 (Xi — x) £ (x if ~x t ) = 0 for all /.] 

Hence £ £ (x {) -S)*=£ £ (x tf -*t) 2 +£ n< (x^x 2 ...( l) 
i=ii-l «=iy=t i=i 

The equation (I) shows that the total variation given on the 
left hand side is divided into two parts : One, given by the second 


452 


STATISTICS 


terra on the right is the variation between classes and the other, 
given by the first term on the right, is the residual variation within 
classes , after the variation between classes has been separated from 
the total variation. Thus 

Total variation=variation between classes4- variation within classes. 

We now set up the null hypothesis that the classifying factor 
has no effect on the value of the variate so that each class into 
which the sample is divided by this factor will be a random sample 
from the parent population. 

Taking expected values of both sides of (1), we have 

eIeE (xh-X,* \=(7V— 1) a*, 

l«=l y-1 I 

where a 2 is the variance of the population, 

[ k ni ~l k i ni .1 

E E (*„-*,)• =£ 2 (*,*-*<)'\ 

i=1 _/•=l -• »=l 1 j- 1 1 

k 

= 2 (fl<— 1) a 2 
» = I 

= {N~k) c 2 . 


It follows that : 



n { (3<—»)*]— (W—0 <J S —(AT-fc) <s 2 


~=(/c — 1 ) a 2 . 


We see from above that 2 n { (.**—*)*, . 

i A n t 

E E <*«,-*,)■ 

N-k i=x jt= y 

t k m . 

and . 2 E (x {i —X) 3 

N-l |0 i j-\ 

are all unbiassed estimates of o 2 . 

At this stage we make a further assumption that the sampled 
population is normal. With this restriction the first two of the 
above estimates of e 1 are independent. The fatio of these 
estimates is distributed as F with/c — 1 and N—k degrees of 
freedom. 



ANALYSIS OF VARIANCE 


453 


Hence we may employ the F -test to see if the between- 
sample estimate is significantly larger than the residual estimate. 

Note (i) In some cases there is high variability within 
the samples and smaller variability between the means i.e. even 
when the chance effects are very large, the sample averages remain 
fairly stable. In such cases F is generally less than one. 

I k n i 

(ii) Although the quantity „ - Z Z (.t„—.v 1 , 2 provides 

N “ 1 i=l j=l 

an estimate of the over-all variance, it is of no use in an T-test. 
So it is useless to compute this estimate. 


We now draw up the following analysis of variance table : 
Analysis of variance for one criterion of classification : 


(1) 

(2) ! 

(3) 1 

(4) 

(5) 

Source of 
variation 

Sum of 
squares 

1 

Degrees of 
freedom 

Estimated 

variance 

Variance 

ratio 

1 F 

Between 

classes 

k 

E n t 5;* 

i=l 

k-\ 

Vx 

k— 1 

N-k V. 

Within 

classes 

= ^2 

N-k 

y, 

, J\~k 

i 

k — 1 Vo 

Total 

k m 

Z Z(x ir s )« 
«=li=l 

= V 

A . a » 

N— 1 




The columns (2) and (3) are additive but the column (4) is 


not so. Thus we get a simple check to be made in numerical 
examples. 

18*4. Solved Examples. 

1. To lest the significance of the variation of the retail price 
of a certain commodity in the four principal cities, Bombay, 
Calcutta t Delhi and Madras, seven shops were chosen at random in 
each city and the prices observed were as follows : 

Bombay 82 79 73 69 69 63 6! (nP.) 

Calcutta 84 82 80 79 76 68 62 

Delhi 88 84 80 68 68 66 66 

Madras 79 77 76 74 72 68 64 


454 


STATISTICS 


Do the data indicate that the prices in the four cities are 
significantly different ? Tabulate the results properly for the study 

of variation both between cities and within cities. 

J (Agra B. Sc. lv 58) 

We set up the hypothesis that the price of the commodity 

in the four cities is the same. ■ f 

Since a shift of origin does not affect variance of a set of 

values, we may take any convenient origin which will reduce the 

calculations. In this case, we take the origin at x-78. 

The work of calculation may be further reduced by using the 

following identities. 

L et S=Z 27 x ti and S { = Z x t j. 

i j J 

We then have 

(i) 27 Z (x i} -X) 2 =Z Z Xif-Nt* 

i j i J 

o S 2 

= 27 27 x,/--. 

» j • /v 

(ii) 27 27 >,i-.^ 3 = 27 {27 

* J ‘ J 


-f!f'"'■?} 
-ff *«•-?(")' 


(iii) 


2; 27 (x,j—S) 2 —27 27 (x„-S,) 2 

i » J 1 J 

Sf\ S 1 


f (S-5- 


In the present example, w,-= 7 for all i and k —4. 
Hence IV—nk = 2S. 



ANALYSIS OF VARIANCE 


455 



20-55 552 


09 14 238 


• S, - 7,X71 tun 
jV - 28 -°° a PP rox 


-7, & p?*»- 

"I I J 

= 274-97 = |72l 


Analysis of variance table is then as follow. •— 


Source of Sum of squa- Degrees of Estimate of 
variation res of devia- freedom variance 

tions 


Between 

cities 


274*97—180 j , , , 

= 94*97 *—1=3 


Within 

cities 


1721-274*97 . 

= 1446*03 N ~ k 


94*97 

3 

— 31 *66 

1446*03 
" 24 
= 60*25 



Total 


1541 


N— I =27 


fourcMt 1 ^ 0002 be,Wee " the <*ce, ° f ‘»c commit,y X 


















456 


STATISTICS 


2. The numbers of leaves were taken from each of half a 
dozen trees and their lengths measured. The following are the 
measurements in millimeteres :— 


Tree 


Length 


I 

82 

87 

86 

90 

81 

84 



2 

85 

84 

91 

92 

88 




3 

92 

90 

84 

86 

88 

93 

89 

90 

4 

80 

86 

87 

81 

82 

82 



5 

87 

86 

88 

90 

85 

86 

87 


6 

90 

86 

84 

85 

85 

86 

87 

84 


Can all these leaves be regarded as having come from the same 
species of trees ? 

We set up the hypothesis that the leaves come from the same 
species of trees 

We take the origin at x=S5 and set out the table as in the 
previous example. 


Numbers of leaves 


Trees 
(A = 6) 




85 84 91 

0 1-1 6 
0 I 1 36 



! 92 

90 

84 

7 

5 

-1 

49 

25 

1 

80 

»/■» 

00 

87 

-5 

1 

2 

25 

1 

4 


Total 



86 88 
1 3 

1 9 


90 86 84 85 85 86 87 84 87 

5 1-10 0 12-12 

25 1 1 0 0 1 4 1 4 


^=58x58 
.V 41 8205 




























ANALYSIS OF VARIANCE 


457 


Here #=6-1-54-8-4-6 + 7 + 9 = 41. 

The table for analysis of variance is shown below 

Sum of 
squares 


234-82*05 
= 151*95 


486-234 
= 252 

Total 403*95 

The 5 per cent value of /“"for v t =5 and v 2 = 30 is 2*53 and for 
v 1= 5 and k 2 =40 this value is 2*45. Hence the 5 per cent value 
of Ffor v 1= 5 and v 2 = 35 will lie between 2*45 and 2*53. Since 
the calculated value ofFis much greater than this value, the 
hypothesis is rejected. In other words, all the leaves cannot be 
regarded as having come from the same species of trees. 

3. The following table gives the results of experiments on four 
varieties of a crop in 5 blocks of plots :— 



1 

2 

Block 

3 

4 

5 

Variety A 

32 

34 

33 

35 

37 

D 

34 

33 

36 

37 

35 

C 

31 

34 

35 

32 

36 

D 

29 

26 

30 

28 

29 


Prepare the table of analysis of variance to test the significance 

of difference between the yields of the four varieties. 

(Punjab M. A. *46) 

We set up the null hypothesis that there is no significant 

difference between the yields of the four varieties. 

Taking the origin at x = 32, wc draw the following table : — 


Source of 
variation 


Between 

cities 


Within 

cities 


Degrees of Estimate of 
freedom variance 


k— 1 = 5 


N—k = 35 


151*95 

5 

= 30*39 


252 


= 1-20 


30*39 

7*20 

=4*22 






458 


STATISTICS 



■r 

• • 

Blocks (n 

=5) 

• 

- 

- 

— 

Varieties 
U=4) 

i 

1 


• 

1 


.Si 

s j1 

n 

Sxjj 2 


j 1 

2 

1 3 

4 

1 5 




32 

34 

| 33 

35 

37 

11 

24 2 

39 

A 

0 

2 

1 

3 

5 



0 

4 

1 

9 

25 





34 

1 33 

36 

37 

35 ' 

15 

45 

1 55 

D 

2 

1 

4 

5 

3 



4 

1 

16 

25 

9 

1 


1 



31 

34 

35 

t 

32 

36 

8 

12*8 I 

30 

C 

-1 

2 

3 

0 

4 

| 




1 

4 

9 1 

0 

16 

4 

1 



t 

♦ 

29 ! 


30 

28 

29 

— 18 

64-8 

74 

D 

-3 


-2 

-4 

-3 

j 

f 


9 

4 


4 

16 

9 i 




\ 

Total 

| 


S 1 16x16 
N 20 

12-8 


S=16 

146 8 , 

1 

1 

198 


Analysis of variance table. 


Source of 
variation 

Sum of, 
squares 

t* • 

< 

Degrees 
of freedom 

Estimate of 
variance 

1 

F 

• 

| 

Between 

varieties 

146-8-12*8 
= 134 

1 

£—1=3 

\ 

134 A-7 

i -y^ 44 ' 67 

44-67 

3-2 

= 13-96 

Within 

varieties 

198-146-8 
= 51*2 1 

i 

i ” ( 

N-k ! 

= 20 —4= 16 

5l ’ 2 _3-2 : 
16 

Total 

\ 

185*2 

i 

i 

19 

— 

— 


The 5 per cent value of Ffor v,= 3 and v 8 =16 is 3*24. Since 
the calculated value of F is much greater than this value, the 
hypothesis is rejected /. e. there is a significant difference between 
the yields of four varieties. 















ANALYSIS OF VARIANCE 


459 


4. What are the purposes of analysis of variance and what 
are the assumptions involved in the interpretation of such analysis ? 
Three varieties A, B, C of a crop are tested in a randomized 

block design with four replications, the layout being given m the 
diagram appended The plot yields in pounds are also indicated 
therein. Analyse the experimental yield , and state your conclusions 
(5 per cent value of F for v x = 2 and v 2 = 9 is 4 26). 

CAB 

5 8 9 


C A B C 

8 4 I 6 9 


B B C I A 

7 6 10 | 6 

(Delhi M. A. Final ’59, I. A. S. ’50) 
For the fust part, see the theory. 

For the second part, we set up the hypothesis that there is 

no difference between the yields of three varieties of crop. 

The calculation of the variance between and within the 
varieties is shown in the following table. We take the origin at 



x=7 :— 


Varieties 

(*-3> 


U 


Total 


Number of Block (n = 4) 


6 

-I 

1 


9 

2 

4 


5 

-2 

4 


8 


4 

3 

9 


7 

0 

0 


8 


9 

2 

4 


*-0 

N 


6 

1 

1 


6 

1 

1 


10 

3 

9 


S i 


-4 


n 




12 


0 


0 


0 


18 


36 





460 


STATISTICS 


The analysis of variance table is as follows :— 


Source of 
variation 

Sum of 
squares 

Degrees 
offreedom 

Estimate of 
variance 

, 

Between 

varieties 

© 00 

1 II 

00 

k- 1=2 

f-4 

4 

311 

= 1-286 

Within j 

varieties 

1 

| 

36-8 
= 28 

1 

N—k j 

= 12-3=9 j 

?- S =3-l 

9 

Total 

36 

1 

----j. 

JV—1 = 11 j 


— 


Since the calculated value of F is much less than the 5 per 
cent value of F for the above degrees of freedom, the hypothesis 
is correct. In other words, the difference in varieties of the 
given crop does not have any significant effect on the yield. 


Exercises 



In a hypothetical experiment 48 subjects are assigned at 
random to 8 groups of 6 subjects each. These groups are 
tested under 8 different experimental conditions, designated 
respectively A, B, C, D , E, F, G and H. 

Condiiions. 


A 

B 

C 

D 

64 

73 

77 

78 

72 

61 

83 

91 

68 

90 

97 

97 

77 

80 

69 

82 

56 

97 

79 

85 

95 

67 

87 

77 

Prepare 

the 

table of analy 


E 

F 

G 

// 

63 

75 

78 

55 

65 

93 

46 

66 

44 

78 

41 

49 

77 

71 

50 

64 

65 

63 

69 

70 

76 

76 

82 

68 


of variance to test the signi 
ficance of difference among 8 condition-means. 


[Ans. S. S. between condition means = 3527. 
Mean square between condition means = 503*9. 

S S within condition means = 5666. 
Mean square within conditions= 141 *6. 



ANALYSIS OF VARIANCE 


461 




503-9 

141*6 



And 5 per cent value of F for v,=7 and v 2 = 40 is 2*24 

Which is less than the calculated value. Hence the difference 
is significant.] 

Set up a table of analysis of variance for yields of three 
strains of wheat planted in five randomized blocks. 

Strains Blocks 


A 

20 

21 

23 

16 

20 

B 

18 

20 

17 

15 

25 

C 

25 

28 

22 

28 

32. 


[Ans. Mean square between strains = 95. 

Mean squarewithin strains= 11*67. 

F= 8 • 14. 


And 5 percent value of F for *>, = 2 and r 2 = 12 is 3-8S 
which is much less than the calculated value. Hence the 
difference is significant.] 

3. The following shows the lives in hours of four batches of 
electric lamps :— 

Batches 

1 1600. 1610, 1650, 1680, 1700, 1720, 1800 

2 1580, 1640, 1640, 1700, 1750 

3 1460, 1550. 1600, 1620, 1640, 1660, 1740, 1820 

4 1510, 1520, 1530, 1570. 1600, 1680 

Perform an analysis of variance on these data and show 
that a significance test does not reject their homogeneity. 

[Ans. Mean square between batches = 14,787 
Mean square within batches = 6,880. 

F= 2*15 which is not significant.] 

4. The following table shows the yields in pounds of lima beans 
on 20 plots of ground subjected to four 'different treatments, 
five plots per treatment. Set up an analysis of variance table 
to test the significance of the difference between the yields 
due to different treatments. 



462 


Statistic^ 


Plots 1 

f 

1 

2 

3 

4 

I 

26*3 

1 

18*5 

36-4 

398 

. 2 

36*0 

21*1 

• 

21*8 

j 

28*7 

3- ' 

54*2 

29*3 

24’* 0 

21*2 

4 

| 25-7 

i 

17*2 

18*5 

| 

| 

3$*4 

5 

i i 

52*4 ; 12*4 

10*2 

29 0 . 

I , 


fAns; Mean square between treatments= 348*29. 
Mean square within treatments= 100 33 

F=3*48. 


The 5 per cent value of F for v t —3 and v 2 = 16 is 3 24 
which is less than the calculated value. Hence there is sta¬ 
tistical evidence for genuine treatment effects]. 

5. The weights in gm. of a number of copper wires, each of 
length 1 meter, were obtained. These are shown below 
classified according to the die from which they come:— 




Die No. 



/ 

II 

Ill 

IV 

V 

1*30 

1*28 

1 *32 

1*31 

1*30 

1*32 

1*35 

1-29 

1*29 

1*32 

1-36 

1*33 

1*31 

1*33 

1*30 

1*35 

1*34 

1*28 

1*31 

1*33 

1*32 


1*33 

1*32 


1*37 


1*30 




Test the hypothesis that there is no difference between the 
mean weights of wires coming from the different dies. 

[Aris. Mean square between dies = 8‘99. 

Mean square within dies = 4 - 97, 

F= 1*81. 



ANALYSIS OF VARIANCE 


463 


And 5 per cent value of F for v, = 4 and v 2 = 20 is 2 87 
which is greater than the calculated value. Hence the hypo¬ 
thesis may be accepted.] 

18’5. Two criteria of classification. We now proceed to the 
case, where there are two criteria of classification. We classify 
our sample of N values of x according to some quality A into k 
Classes, and, according to another quality B, into n classes so that 
N=nk. Let the sample variate value in the /th ^f-ciass and 
y’th 5-class be x u . Let x f be the mean of the y'th row and a?, the 
mean of the ith column. We then have the following identity : 


k n 

£ Z (x tJ 
i=\j=i 


-*)* = 


k n 

£ Z {(Xa—Si -S, + *) + (*\- X) + (Xj 
« = 1 j= 1 



“ z £ (*,,-3.-*, + *)*+ Z Z (.V,-.V)- 
i=ly=l »=1 j =I 

n k 

+ Z Z (x, — x,~ 

j =1 «=l 

k n k 

— Z Z (xa-St — Sj — 3) 2 -f £ n (.V, — .«) a 
1 1 «=1 


+ Z k (s,-.?). ...(I) 

j= 1 

The product terms in the expansion vanish as in the case of 
pnc criterion of classification. 


For example, we have 


k n 

Z Z (*,-*) 

i=i y=i 


£ (3<-.V) 2j (Xij—X\ —X'y-f-.V) 

.=1 7=1 

k 

= £ (*|—5) nx t — wjc-f /i;c) = 0. 

»>1 

Similarly other cross-prcducts can be shown to be zero. 
This is left as an exercise for the students. 


It can be shown that if all the data are drawn from the same 
population with variance a 2 , the three terms on the right hanJ 
side of (1) are the independent estimates of (k— 1) (/;— l) 0 -, 

(k— l) a 2 and (n— \) a 2 respectively. The analysis of variance 
table is then as follows :— 



464 


STATISTICS 


Analysis of Variance for Two Criteria of Classification 


Source of 1 c r 

Variation Su "‘ °f s 1" ares 

i 

! Degrees Estimate of 

J of freedom variance 

| Remarks 

Betwen 

/1-classes 

1 k 

Z n (Xi—si } 2 
i—l 

k—l 

I 

j 

=Qa 

(i) Tcst £* 
for k—l and 

Between 

if-classes 

Z k (2, -3)* 
7=1 

• 

| n-i 

1 i 

1 

n _ j Sk (Si—*)* 

=Qb 

(fc — l) ( n — 1) 
degrees of 

freedom using 
/•’-table, and 

(ii> test qL 

for («— 1) and 

(*-,) («-D 

degrees of 

freedom using 
the /--table. 

Residual 

k n 

E £ ( X ii &i 

i=lj=l 

-•V; + S) 2 

(k- i)(n- 1) 

• 

» 

1 

m n 

Z Z 

*=17=l -s,+v) 2 

(^-lix(«-i) 

=Qab 

Total 

k n 

Z Z (X if -x) z 

1 = 17 = 1 ! 

nk-l 

— 


18 6. Solved Examples. 

1. Four varieties of potato are planted, each on five plots of 
ground of the same size and type ; and each variety is treated with 
Jive different fertilizers. The yields in tons are as follows :— 

Farit ty Fertilizers 



1 

2 

3 

4 

5 

/ 

1-9 

2-2 

2-6 

1-8 

2-1 

2 

2-5 

1-9 

2-3 

2-6 

2-2 

3 

1-7 

1-9 

2-2 

2-0 

2-1 

4 

2-1 

1-8 

2-5 

2-3 

2-4 


Perform an analysis of variance and show whether there is any 
significant difference between the yields of different varieties or 
due to different fertilizers. (Agra B. Sc. ’60) 

Since the variance of a set of values is independent of the 
origin, a shift of origin does not affect variance-calculations. 
Again since we are concerned only with the ratio of two 
variances, any change of scale will not affect the value of this 


Variety 


ANALYSIS OF VARIANCE 


465 


ratio. In this case we take the origin at 2 tons and the unit of 
measurement as vV tons. In other words, we put 

u,j= 10 (Xtj — 2). 

Yields of potato after change of origin and scale. 


Variety 
1 

2 

3 

4 


1 

— 1 

5 

-3 

1 


Fertilizers 
2 3 4 

2 6-2 

-1 3 6 

-12 0 
“2 5 3 


5 

1 

2 

1 

4 


With the suffix j denoting the row and the suffix / denotin* 
the column, let 

S=E £u tj . S i =£u i j, S } =Eu u . 
i j j i 

We then have 

i? £ (I/„-«)» = £ £u?-Nu*=£ £u a ‘-f 

- - • • • N' 


1 J « J i j 

£ £{u,-uf=£« t (u l -u)t = £ -£-~ r . 
I j i i n, N 

£ (uj-u) 2 =Z 

i j j j n> A' 

We now draw up the following table :— 


Fertilizers 


6 ’.* 


U|) 3 


-1.(1) 2(4) 6(36) -2(4) 


1 ( 1 ) 


5 (25) 


-1 (I) 


3 (9) 


6(36) 


-3(9) -1(1) 2(4) 0(0) 


2(4) 


1 ( 1 ) 


15 


36 


225 


46 


75 


-1 


15 


1(1) -2(4) 5(25) 






















466 


STATISTICS 


= 191-48 05=142 95 


Hence, we have 

(i) Total sum of squared deviations 

i j 

= 19 l_ 3 i^I=191-48 05=' 

with 5x4—1 = 19 degrees of freedom. 

(ii) Sum of squares between fertilizers 

Z - 48 -05 

i m N 4 

=94*25—48*05=46 2 

with 5—1 =4 degrees of freedom. 

(iii) Sum of squares between varieties 

3 4 3 - 48*05 

j »i * 5 

= 76*60-48*05 = 28 55 

with 4 —1=3 degrees of freedom. 

(iv) Residual sum of squares • 

= 142*95 —46*20—28*55 = 68 20 

with (4-1) (5—1)= 12 degrees of freedom. 

The analysis of variance table. 


Source of 
variation 


Sum of 
squares 


Degrees of 
freedom 


Estimate of 
variance 




Between 

Fertilizers 


46*2 

4 

= 11*55 


Between 

varieties 


Residual 


I 

1 1 

1 

28-55 

i 

I 

3 

1 

28*55 

3 

= 9*52 

68*20 

12 

i 1 

68*20 

12 

=5*68 

1 i 


11*55 
5*68 
=2 03 


9*52 

5*68 
= 1*67 


Total 


142*95 


19 








ANALYSIS OF VARIANCE 


467 


The 5 per cent value of Ffor i',=4 and i- 2 =12 is 3*26. 
Since the calculated value i. e. 2-03 is less than this value, the 
difference between fertilizers is not significant. 

Again the 5 per cent value of for F *’ 1 = 3 and v 2 =12 is 3*49, 
which is much greater than 4*67. Hence the difference between 
varieties is also not significant. It follows that the data are 
homogeneous. In other words, the data can be supposed to be 
obtained from a population in which there was no significant 
difference between the yields of varieties and the fertilizers did 
not differ in their effect. 

2. In an experiment on the effects of temperature conditions 
on human performance , 8 practised subjects were given a sensori¬ 
motor test in each of 4 temperature conditions. Since the subjects 
were ail practised, the order in which the tests were done was un¬ 
important. The tests were randomised amongst the subjects, so 
that for each condition there were equal numbers of jirst testing , 
second testing, third testing and fourth testing. The scores in the 
tests are shown in the following table : 


Subjects (A) 


£ 


/ 

2 

3 

4 

5 

6 

7 

8 


i 

76 

80 

79 

90 

85 

101 

94 

83 

§ 

2 

75 

81 

77 

90 

86 

98 

93 

85 


3 

76 

78 

76 

91 

82 

98 

92 

83 


4 

68 

75 

72 

85 

82 

90 

82 

77 


Perform an analysis of variance and show whether there is any 
signijicant difference between the scores of subjects or due to 
temperature conditions. 

Wc take the working mean at *=84 /. e., we put 

"«>=(*o-84). 



Temperatures (£) 


468 


STATISTICS 


Then the calculations of variances are shown below using the 
notation of previous example. 



1 —8(64) -4(16) -5(25) 6(36) 1(1) 17(289), 10(100) -1(1) 16 j 256,532 



6(36) 

1 

2(4) 

14(196) 

9(81) 

KD 

13 

( 

169457 

7(49) 

I 

-2(4) 

1 

14(196) 

8(64) 

-1(1)! 

1 

4 

16 478 


-16(256) -9(81) -12(144) 1(1) —2(4) 6(36) -2(4) -7(49) -41 1681 575 


22 -32 20 I -1 



5,* 1681 484 i 1024 400 



465 1 142 282 I 122 13 717 249 1 52 2042 


212212042 


/■ 

27 27 u i} 

• • 

I * J 


The total sum of squares= 


s . 2 
* J 


= 2042-11=2040, 


(ii) Sum of squares between subjects 

'U N 

(iii) Sum of the squares between temperatures 

j ", N 

= 2 A 8 8Z —-2=263*25. 


(iv) Residual sura of squares = 2040—1718 —263*25=58 75. 





































ANALYSIS OF VARIANCE 


469 


The analysis of variance table. 


Source of 
variation 

Sum of 
squares 

Degrees of 
freedom 

Estimate 
of variance 

l 

Between 

subjects 

1718 

7 

1718 

■7—245-4 

Between 

temperatures 

263-25 

3 

263 3 25 = 87-75 


F 


245 4 
2-8 


= 87-64 


87 75 
2-8 


= 31-34 


Residual 

58* / 5 

21 

58 ‘ 75 = 2-8 




21 

1 ! 

Total 

2040 

31 



The 5% value of F for v, = 7 and v 2 =21 is 2 50 and for v t = 3 
and v 2 — z\, the value of F is 3‘07. Since both the calculated 
values are much greater than the corresponding tabular values, 
we reject the hypothesis that the data are homogeneous. 
The variance between subjects is very large, which shows that 
there are highly significant differences in ability between the sub¬ 
jects. The effect of temperature conditions is also quite significant. 

Exercises. 

1. Five doctors each test five treatments for a certain disease 
and observe the number of days each takes to recover. The 
_results are as follows (recovery time in days) : 


Doctor 


Treatment 

2 3 4 5 





470 


STATISTICS 


Discuss the difference between :— 

(a) doctors, and (b) treatments, 

[Ans. S. S. between treatments=406*64 ; d. f. = 4. 

S. S . between doctors =25*84 ; d. f.=4 
Residual =34*56 ; d. f.= 16. 

F= 47*00 for treatments 
and F= 2*99 for doctors. 

The 5% and 1% values of /'for v i =4, v 2 =16 are 3*01 and 
4 77 respectively.] 

Hence the difference between treatments is highly signi¬ 
ficant whereas the difference between doctors is hardly 
significant. 

2. The determination of visual acuity at three different distances 
(say A, B and C) was the subject of a recent experiment. 
Four different subjects chosen at random from a large group 
were used for this purpose. The data recorded were as 
follows :— 


Subject 

| 

A 

Distance 

B 

C 

1 

12 

16 

30 

2 

5 

10 

18 

3 

7 

28 

35 

4 

10 

26 

51 


Carry out an analysis of variance on these data. Test for 
the effects of subjects and distances at the 5% level of 
significance. 


Ans. F=3*55 for subjects 

F= 12*93 for distances 

The 5% value of F for v x — 2 t v 2 =6 is 5*14 and for Vj=3, 
v 2 =r6, this value of Fis4*76. Hence the difference between 
distances is highly significant whereas the difference between 
subjects is not signidcant. 

3. Four experimenters determine the moisture content of samples 
of a powder, each man taking a sample from each of six 



ANALYSIS OF VARIANCE 


471 


consignments. Their assessments are— 


Observer 

' 1 

2 

Consignment 

3 4 

5 

6 

1 

9 


9 


11 

11 

2 

12 

11 

9 

11 



3 

11 


10 

12 

11 


4 

12 

13 

11 

14 

12 

10 


Perform an analysis of variance on these data and discuss 
whether there is any significant difference between consign¬ 
ments or between observers. 


Ans. 

S.S. 

d.f. 

Estimate of 

F 

variance 

Between 




consignments 

9*71 

5 

1*94 2*23 

Between observers 

13*13 

3 

4 * 38 5*03 

Residual 

13*12 

15 

0*87 — 

5% value of F 

for Vj = 5, 

II 

is 5*05 and for »' i = 3, 


Vj,= 15, this value of F is 3*29. 

Hence the difference between observers is significant where¬ 
as the difference between consignments is not significant. 

4 . On a feeding experiment a farmer has four types of hogs 
denoted by I, II, III, IV. These types are each divided into 
three groups which are fed varietal rations A, B and C. 
The following results are obtained, the numbers in the table 
being the gains in weight in pounds in the various groups. 



I 

II 

III 

IV 

A 

7*0 

16*0 

10*5 

13*5 

B 

14*0 

15*5 

15*0 

21*0 

C 

8*5 

16*5 

9*5 

13*5 








472 


STATISTICS 


Perform an analysis of variance on these data and test the 
significance of variation between rations and between types. 


A ns. 

S.S. 

d.f. 

Estimate of „ 

F 




variance 

Rations 

54*1250 

2 

27*06 5'76 

Types 

87*7292 

3 

29*24 62 

Residual 

28-2083 

6 

4*70 - 

The 5% 

and 1 % values of F for 

v, = 2, v 2 = 6 are 5* 14 and 


10 92 respectively and for v t = 3, v 2 =6, these values of F are 
4*76 and 9*78 respectively. Hence there is a significant 
difference between breeds and between varieties of rations 
at 5% point, but that neither is significant at 1% point. 

18*7. Three criteria of classification. Let us suppose that 
the sample values of a given normal variate are classified accord¬ 
ing to three criteria A t B, C into / groups of m rows and n 
columns so that N—lmn where N is the total number of sample 
values. Let xt Jk denote the value of the variate in the jth row 
and A:th column of the /th group where i=l,2, 3, 

7= 1, 2, 3, ... m and k — 1, 2, 3, .. .n. 


In this case, we use the identity : 

/ m n 

£ £ £ (x iik -sf 
‘=lj=lk =i 


/ rn n 

£ (s i —x)-+kl £ (^-S) 8 +/w £ (s*-s) 2 

« = l 7=1 k =1 

m n In 

+ 1 £ £ (Sjk—Xj—Zk+z)**™ £ £ (Xik-Zt- Sfc+s ) 2 

7=1 it = 1 i = l k= 1 

/ m 

*= 17=1 

l m n 

+ ^ ( X ut—Zjk—Sik—Xij+ ty-t-Sj+S*— X) 9 , ...(1) 

i 1 j — 1 k — 1 

where 3 = overall mean; 

1 n 

*a = ~ 2 ■T(>jfc=mean of values in /th group and y'th row 
k — 1 

with similar meanings to n jk and 

j m n 

2 ^ *,j* = mean of the ith group 

nw j=lk=l 6 * 

with similar meanings to Xj and x k . 



ANALYSIS OF VARIANCE 


473 


First three terms on the right hand side of identity (1) give 
the variations between groups, between rows and between 
columns respectively. These are known as the main effects due 
to the factors A , B, C respectively. The next three terms are 
known as interactions between B and C, between C and A, bet¬ 
ween A and B respectively. These interactions involve two main 
effects and are called first-order interactions. Finally, the seventh 
term on the right hand side of (1) is the residual. As a matter 
of fact, this term is the interaction of all three main elfects and 
is known as a second order interaction. The number of degrees 
of freedom involved in an interaction is always the product of 
the d. f. of the component main effects. Thus the d. f. for 
interaction BC will be (m-\) (n-\), for AC (n- 1) (/-1) and 
fovAB(l—\) (m — 1) whereas for the second-order interaction 
ABC the d. f. will be (/ - 1) (m — 1) (m— 1). The total d. f.’s are 
{Inin — 1) and it is easy to verify that 

(/;»!/!- 1)«(/- 1 ) + (/#!—1) + (n -D + (m- 1) (rt—1 ) + (/i— I) (/— 1) 

+ (/—!) ("i— !) + (/— 1) (w— 1) (n- 1). 
so that the degrees of freedom are additive. 



Analysis of variance table for three Criteria of Classification. 


474 


STATISTICS 










Example. Two manurlal treatments T x and T 2 were applied to five varieties of a cereal at four different 
stations. The yields in tons are given in the following table [Fictitious data]. 

Varieties 


ANALYSIS OF VARIANCE 


475 


CD 






5 


•xs 

CJ 

*o 

v* 

>• 

2J 


$ 

o 


o 

ja 


o> 

CD 

<u 

£ 

o' 

CO 

<u 

s 




<0 


! . 


O £ 

^ IS 


^0*0 00 


§ 

<D 


. M *N 

■ 

^ 1 c- 

OS oo 


_ I *o Oo 

» r- 
hi ^ 


, 71 

K o 


J 


Oo r\ 


Oo 


'o 

<N 

Oo 

O 

•O 


c: 

<d 




a 

o 


o 

X 

CO 

CO 

CD 

c 

• 

*o 

CO 

&> 

i- 


CD 

C 


<o 

S! 

.2 

<o 


°o ^ 
o 

2 ; * 

^ 2 > 

^ -c ^ 
oo c 

- *-a 

3 3 * 

IB 


£ 

<U 


fN ^ ^ > 


•23 a 

.<C 


C 

co 

<Si 

C 

o 

<N 


2 CO 


a 

CO 

6 

CD 

C 


S * 

« e 

g -2 
<3 

o o 


^ * 
CX, *> 


S2 


O 

J 

a> 

j= 

o 

* 

CO 

D 


43 

C3 

£ 


•O 


a 

H 


CM 


jd 

X) 

CO 


60 

"5 


g N ff> pi l5e 

T7"-h 


>0 TT O 

, - CM 


KO - mx 


1 ^ 


^ — <N rr> VO 130 

.3 1 1 

2 oo »n ir> ■n- i^t 

(5M li 

, «— — — JO 

^ I I I 

{2*m Tt rf — 1^- 

II II 




c"0 rr> Tf vo I— 

-7 1 1 


rr NO VO IT) ISO 

£ i I 


W(N —« CO CM I CM 

K i i i 

^-Tt */■*> CM CO I OO 

^ i i i h 

SowiftOsM 

£7 i I 

„,Tt — rN VO I ro 

^ I I i 

L'Crimm ;cm 

ii > i 


o 

“f* 

N 

+ 

OJ_ 

I 

+ 

OJ 

rn 

+ 


kTr- ’t o ni» 

II I i 


I 

m 

+ 


CJ 

^ — » 
7 +4- 

i-tb 

4 -+-+- 

Cl 

+7 + 

N I 

--- N 

J+ + 

I c> 

^ o 

+ + + 

^<M 

^ Tf 

I I CJ 

O 


= J+++ 

• o 


I 2 ^: 


i i 


i i 


c/) •• 

s« 


CO I 

^ + 
!!>. 
o ^ 

■2 f 

U-m I 

o + 
</> ^ 
o tT 

rt I 


cr a. 

v> * 

O ^ 
C I 


5* CNJ 
— ^ , 

+ + + 

04 ^ C4^ 

ICS CM* 

ii- 

+ + + 

w 


OO 


+ + + 
« 

CN 

i^- 

+ + + 


60 

S2 — 

.2 <o 

2 

co 


a 

CM rn Tf ^ 

iS 


</) 

*co 

O 

H 


II 


o 

O’ 

NO 

u 


And total sum of the values 



476 


STATISTICS 


Hence 27 (*ijk— 3)* = 640—640— 1 *6 

= 638*4 ...(A) 

Main effects. 

(1) The totals for stations are —40, 22, 13 and 41 each 
being the sum of 5x 2 i e. 10 readings. 


Hence 


27 (*,—*)• 

f (_40)2+22’+13 2 +41 2 \ (—8) 2 

10 J .40 

= 393*4-1-6 = 391*8 



(2) The five totals for varieties are 1, —6,—19, 
20, each being the sum of 4x2 = 8 readings. 

{l* + (—6) 2 + ( —19) 2 4- (—4)*-f2C*} 

Hence 27 (»,— 3)*—--- 


— 4 and 



_?*?_I-6=100 15. 

o 


are 


...(C) 

obtained by adding 


(3) The totals for treatments 
together the five totals for T x and T t . 

Thus Total 7\=—2 —8 —11—4-f-8= 17 

Total 7\= 34-2—8 -j-0-J- 12 = 9. 

Clearly each of these totals is the sum of 5x4=20 readings. 

'—17)*+9M (—8)* 


Hence 27 I3 k 




20 


40 


370 


-IB - 1 ' 6 


= 18*5—1-6=16*9. ...(D) 

First order interactions. These interactions are best obtained 
by first forming three two-way tables from table I. 

(4) This table for interaction between stations and varieties 
is as shown in the following table 



Vx 

V* 

Table II 
V, 

v A 

V, 

Totals 

s t 

-10 

-6 

-17 

-8 

1 

-40 

st 

-3 

-6 

-7 

-5 

— l 

-22 

S 9 

5 

1 

— 4 

5 

6 

13 

s* 

9 

5 

9 

4 

14 

41 

Totals 

1 

-6 

-19 

-4 

20 

-8 


The sum of the squares of the values in the main body of 
table will be found to be 1112. Here each entry is the sum 


ANALYSIS OF VARIANCE 


477 


of two values, and with an obvious extension of previous results 
we have 

s (*«-*)*= 1 *6 = 554-4. 

j z 

Now for the interaction SV, we have 
Z (2 i} — 3 ,—-f-.«) 2 =27 — S)* — 27 s)*_2? .?)» 

= 554-4-391*8 — 100 15 
= 6245. 

Thus interaction sum of squares for SV 

= 62*45. b- (E) 

(5) To find the interaction VT between varieties and treat¬ 
ments, we have the following table : 


T x 

T t 

Totals 


V x 
— 2 
3 

Hf 


y% 

-8 

2 

-—6 


Table III 

V* 

^4 


Totals 

— 11 

— 4 

8 

-17 

-8 

0 

12 

9 

—^19 

— 4 

20 

-8 


...w V1 luc » W u<ucs OI ine values in the main body of the 
table er.ll be found to be 490. Here each entry is the sum of four 
values. Hence we have 

2 (£/fc“3) 2 = ;4 2 fi —1-6=120-9 
Now for the interaction VT, we have 

= 120*9-100-15-16-9 

. =3*8x ...(F) 

trealm t ‘° ^ lhC interaction ST between stations and 

treatments, we have the following table : 


•Si 

•*a 

$4 

Total 



Table IV 

T’a 

Total 

-22 

— 18 

-40 

-16 

— 6 

— 22 

4 

9 

13 

17 

24 

41 

-17 

9 

~—8 


ror5 Th ^ ,abl V S I°Tn d ai a , ddin8 ,-, thefiVe ^ entries in ‘able I 

or /. e. —6-4—10 —3+I = —22, and so on. 


478 


STATISTICS 


Now the sum of the squares of the values in the main body 
cf the table will be found to be 2062. Here each entry is the 
sum of five values so that 

Z (*«-*)*“ 1*6 

i,k 

=412*4-1-6=410*8. 

Hence for the interaction ST, we have 

2 (S*-*<-** + *)*“-2 Ptk-xr-2 (*,-*)*-.£ (**-*)* 

i,k 

=410*8 — 391 *8—16*9 


=2-10. ...(G) 

(7) Finally, we find the residual term by adding (B), (C), 

(D), (E), (F), (G) and subtracting the sum from (A). 

Thus residual = 638*4 — 391*8 — 100* 15 — 16*9 

-62*45-3*85-2 10 


=61*15. 


Analysis of variance table 


Source of 

variation 

Sum of 

squares 

Degrees of 
freedom 

Estimate of 
j variance 

F 

Between 

stations 

391*80 

1 

1 3 

130*60 

j 

21-69 

Between 

varieties 

100*15 

4 

25*04 

1 

491 

I 

■ 

{ 

Between 

treatments 

16*90 

1 

i 

1 

1 

16*90 3*31 

1 

Interaction 

(SxV) 

6245 

12 

5*20 

1*02 

Interaction 

i.VxT) 

( 

3*85 1 4 

1 

0*96 

•19 

Interaction 
(SxT ) 

2*10 

1 

1 

| 

3 

0-70 

*13 

Residual 

61*15 

12 

5*10 

— 

Total 

638*40 

39 

| 

1 

— 

— 















ANALYSIS OF VARIANCE 


479 


Interpretation of analysis. The most striking feature of this 

analysis is tbe very large ratio for stations which implies that the 

inference between stations is highly significant. (Students should 

find the 5% values of F for v l = 3 andv 2 =12). The difference 

between varities are not significant at 1% level but are significant 

at a 5% level as may be verified by the students from F-tabies for 

v } .4 and *2 = 12 The difference between treatments is no? 

significant. It follows that the variation in yields is due to variation 

between stations and (perhaps) between varieties, but cannot bo 

attributed to real differential effects between treatments without 
further inquiry. 

Again it can be verified that the three first order interactions 

are not significant to a 5 per cent level. This means that we may 

assume that there is no “entanglement” between the factors and 

that there is support for the hypothesis that the three are aflectina 
yields independently. ® 



Exercises 

In a certain psychological experiment the apparatus consisted 
of a dial having a rotating needle and a number of gradua¬ 
tions round the circumference. The needle could be rotated 
at 3 different speeds and the dial displayed under three 
diflerent intensities of illumination. Subjects in this 
experiment ftere required to make a certain reaction each 
time the needle reached a graduation on the dial. Since 
there are nine combinations of speed and illumination, each 
subject had 9 tasks to perform. The following table gives the 
number of correct reactions made in each of the 9 tasks by 6 
different subjects. 

Examine the relative effects of speed and intensity of illumi¬ 
nation on peiformance. 


Illuminations (A) 
1 2 
Speeds (B) Speeds (B) 



3 

Speeds (B) 




1 

45 

38 

29 

43 

2 

38 

33 

20 

40 

3 

39 

32 

21 

41 

4 

43 

37 

24 

39 

5 

40 

36 

28 

42 

6 

40 

35 

25 

40 


35 

26 

35 

29 

18 

32 

21 

34 

25 

19 

29 

25 

29 

24 

16 

30 

25 

30 

27 

14 

31 

24 

31 

26 

17 

32 

22 

32 

26 

16 


Subjects (C) 



480 


STATISTICS 



[Ads. Analysis of variance table is as follows : 


Variation 

S. O. S . 

D. F. 

Estimate of 
variance 

F 

Between illuminations (A) 

765*59 

2 

382-80 

150-7 

Between speeds (B) 

2369-37 

2 

1184-68 

466-3 

Between subjects (C) 

11815 

5 

23-63 

9*3 

Interaction AB 

30*52 

4 

7-63 

30 

Interaction AC 

53-07 

10 

5-31 

209 

Interaction BC 

1330 

10 

1*33 

0'52 

Residual 

50-81 

20 

— 

— 

Total 

3400-82 

53 

—— 

— 


Consulting tables, we find that variations between illumina¬ 
tions and between speeds are highly significant both at the 
5 per cent and the 1 per cent level. Differences between subjects 
are also significant. The interaction AB is just significant at 
the 5 per cent level but interactions AC and BC are not signi¬ 
ficant. The experiment shows that both speed and illumina¬ 
tion have a very marked effect on performance, specially 
speed. The data depart significantly from homogeneity.] 

The following table gives porosity readings op 3 lots of con¬ 
denser paper. There are 3 readiogs on each of 9 rolls from 
each lot. Perform an analysis of variance on the data to find 
out whether there are significant variations (i) among readings 
within rolls, (ii) among rolls within lots, and (iii) among 



ANALYSIS OF VARIANCE 


481 


Porosity Readings on Condenser Paper 


Lot 

Reading 

Roll number 

number 

number 

1 

2 

3 

4 

5 

6 

7 

8 

9 


1 

1-5 

1-5 

27 

30 

3*4 

2*1 

20 

30 

5 1 

I 

2 

1*7 

1*6 

1-9 

2*4 

5*6 

41 

2*5 

20 

50 


3 

16 

1-7 

20 

2*6 

5*6 

4*6 

2*8 

19 

40 


1 

1-9 

2-3 

18 

1*9 

2*0 

30 

2*4 

1*7 

26 

II 

2 

1*5 

24 

2-9 

3*5 

1*9 

2*6 

20 

1*5 

4*3 


3 

21 

24 

47 

2*8 

21 

3*5 

21 

20 

24 


I 

2*5 

3*2 

1-4 

7*8 

3*2 

1*9 

20 

11 

21 

III 

2 

2-9 

5*5 

1-5 

5*2 

2*5 

2*2 

2*4 

1*4 

2*5 


3 

3*3 

7-1 

3*4 

5*0 

4*0 

3*1 

3*7 

4-1 

1*9 


[Ans. Analysis of variance table is as follows : 


Variance 

Sum of 
squares 

D. F. 

Estimate of 
Variance 

F 

Between rolls 

26*31 

8 

3*29 

3*14 

Between lots 

7*90 

2 

3*95 

3*81 

Between readings 

5*73 

2 

2*87 

2*76 

Interaction (rolls x lots) 

66*56 

16 

416 

4*00 

Interaction (lots x readings) 

3*27 

4 

0*82 


Interaction (readings x rolls) 

10*19 

16 

0*66 


Residual 

33*10 

32 

1*04 


Total 

153*06 

80 




Obviously the interactions lotsxreadings and readingsxrolls 
are not significant. But infraction rolls x lots is significant at 
1 per cent level. Variation between rolls is also significant at 
1 per cent level whereas that between lots is significant at the 
5 per cent level.] 






CHAPTER XIX 


INDEX NUMBERS 

19*1. Introduction. Index number is a number which 
expresses the relative change in the magnitudes of a variable or 
number of variables during a specified period. The variable may 
be the price of a certain commodity, quantitative production 
of certain goods or the cost of living. It is a statistical device 
to measure the level of a certain phenomenon in comparison 
with a certain period known as base period, which may be a 
week, month, year or group of years. The index numbers 
represent the continuous upward or downward changes in the 
value of the variables and are known as economic barometers or 
economic indicators, since they help in understanding the changes 
in economic conditions of the society. According to Edgeworth, 
*Index numbers are numbers which show by their variation , 
the change in magnitude which is not susceptible , either of 
accurate measurement in itself or of direct valuation in practice. 

19*2. Uses. Index numbers of prices play a vital role 
in measuring the general economic conditions of the society 
and in fixing the wages as well as in setting up of new industries. 
It is well known that wages in 1963 are higher than those in 1939 
but it cannot be said that the people have become more pros¬ 
perous since the price level of all commodities has also consider¬ 
ably gone up. It is for this purpose that the dearness allowance 
in certain industries is linked up with the cost of living 
index number. A businessman may before selecting an industry 
he wishes to start may like to know the trend of change in 
prices, wages and incomes in different industries so that he 
may be able to have a general idea of the comparative courses 
which the future holds for different undertakings. However, 
the index numbers have their own limitations as errors 
are likely to creep in selecting the base year, collection of 
information and selection of commodities. 

19*3. Construction of Index Numbers. The following are 
the requirements for the construction of index numbers : 

(1) The purpose for which the index number is required. 


INDEX NUMBERS 


483 


(2) Selection of items to be included. 

(3) Sources of information. 

(4) Choice of the base period. 

(5) Choice of the average. 

(6) Determination of weights. 

1. Purpose. It is necessary to be clear about the purpose 
for which the index number is required. The selection of items 
also depends upon the purpose of the index number. Thus for 
finding the index number of the cost of living in Bengal, we 
need not consider items like wheat, bajra etc. but will have to 
give more weight to rice, fish and such commodities as are 
generally used there. 

2. Selection of Items. As stated above, the selection of 
commodities differs with the purpose for which the index number 
is required. The items selected should be representative of 
the purpose of the index numbers. Thus if the index number 
for the cost of building materials is required, the following 
commodities should be selected : (i) bricks, (ii) cement, 
(iii) labour, (iv) miscellaneous items like wood, paints etc. 
Similarly for finding the index number of cost of living, the 
commodities are : (i) food, (ii) clothes, (iii) house-rent (iv) fuel 
(v) miscellaneous items like general merchandise, books, school 
fees etc. The number of items selected should neither be too 
small nor too large, since if the number of items is small, the 
index number will not be representative and if it is too large, 
the computation will be tedious and expensive 

3. Sources of information. The information required for 
the construction of index numbers should be reliable. 
Thus for finding the prices, either standard journals should be 
used or it may be obtained through reliable agents in the mirket. 
Moreover, the wholesale prices should be taken into account 
since the retail prices are generally sluggish and vary from shop 
to shop. 

4. Choice of the base period. The base period should be 
such that the conditions may be stable and normal during it. 
There should not have been sudden rises or falls in the base 
period. For cost of living index. 1913, 1926, 1949 and 1952-53 
have been taken as the base years from time to time More¬ 
over, the base period should not be small and so a day, week 
or month are unsuitable for this purpose. 



484 


STATISTICS 


5. Averages. Any of the averages, arithmetic mean, 
median, mode, geometric mean, harmonic mean may be used, 
but generally the arithmetic and geometric mean are used. The 
arithmetic mean gives too much weight to large items and is not 
reversible. The geometric mean on the other band gives too 
much weight to small items and is reversible. It is suitable 
for measuring the average ratio of change in prices. For 
instance if the price of a commodity is doubled and the other is 
halved, the two ratios balance each other. 

6. Weighting. In cases where the relative importance of 
all items is not equal, the weighted average is used for finding 
the index number. Weighting may be implicit e. g. the 
number of varieties of important commodities considered for 
construction of index numbers varies according to their relative 
importance. In explicit weighting, the weight given to each 
commodity is multiplied by its relative price. 

19*4. Fixed and Chaia Bases. In fixed base, a particular 
year is chosen as base year and index numbers are expressed as 
relatives of that year. In chain base, the year preceding the 
current year is taken as the base year. Thus for 1956, the year 
1955 is taken as the base year and for 1957, the year 1956 as 
the base year, and so on. The advantage in the chain base is 
that new items can be added and obsolete ones of the previous 
year deleted. 

Percentage or Price Relative of a commodity of a year. 

p Price of the commodity for the current year 
1 Price of the commodity for.the base year x100, •••(*) 

The average of percentage relatives of all the commodities 
for a year is the Index Number for that year. 

19*5. Average. We may use the arithmetic mean, the 
median or the geometric mean for calculating the index numbers. 
Although the arithmetic mean is easy to calculate, the most 
appropriate average in this case is the geometric mean. Thus if 
the price of a commodity is doubled while that of the other is 
halved, the geometric mean does not show any change in the 
price index, which should be expected, but the arithmetic mean 
shows an increase of 25 per cent. The index numbers calculated 
with the help of geometric mean also satisfy the time reversal 
test while those with arithmetic mean do not satisfy this teat. 


INDEX NUMBERS 


485 


19*6. Index Numbers Based on Arithmetic Mean. 

(i) Simple or Unweighted Arithmetic Mean of Price 
Relatives, 



where P x is the percentage or Price Relative of a 
N is the number of commodities. 


...( 2 ) 
commodity and 


(ii) Weighted Arithmetic Mean of the Price Relatives : 

N 

2 W r P r 

j r== 1 

N » . . .(3) 

27 W r 

f = 1 

where P r is the price relative of the rth commodity, and W r the 
weight given to it and N the number of commodities. 


In case the weights are the money values of the quantities 

produced or consumed in the base year at base year prices, the 
index number 


N 

2 P r (P 0 (r) q u l '>) 

r= 1 

• •••(4) 

£ /V r V r, 

r= 1 


where p 0 lr) is the price of the rth commodity and q 0 lr) the quanti¬ 
fy produced or consumed of the rth commodity in the base 
year. But 


p (r) 

Pr n Tr) Xl0 °. ...(5) 

ro 

where p lT) is the price of the rth commodity in the current year. 
Hence from (4) and (5), 

N 

27 p {r) q o'" 

/= r -^- x 100. 

27 pJ'W* 

r= I 


Similarly, if the money values of total quantities produced 
or consumed in the current year at base year prices are used as 
weights, we have 

jV 



486 


STATISTICS 


where q lT) denotes the quantity consumed or produced in the 
current year. 

Similarly if the geometric mean is used, the corresponding 
formulae can be written. 

19’7. Reversibility Tests. Irving Fisher developed a for¬ 
mula for index number, which satisfies the following two tests : — 

(i) Time Reversal Test. The formula for the index number 
should give the same ratio between one point of comparison and the 
other , no matter which of the two is taken as the base. 

Thus if the index number for the year 1963 taking 1953 as 
the base year is indicated as / 01 and the index number for 1953 
taking 1963 as the base year as / 10 , then the equation 

4u * fo = I 

should be satisfied. 


(ii) Factor Reversal Test. Just as each formula should per¬ 
mit interchange of two times without giving inconsistent results, 
so it ought to permit interchanging the price and quantities without 
giving inconsistent results i.e. the two results multiplied together 
should give the true value ratio. 


If we calculate the price relative, quantity relative and the 
value relative of a year with regard to a base year for a number 
of commodities, the product of price relative and quantity 
relative should give the value relative. Thus 


^iii x 0oi — 


2P\<h 
ZPoQo ’ 


where P ox indicates the price change for the current year over the 
base year; O 01 , the quantity change for the current year over the 
base year; and p x , q x are the price and quantity respectively of the 
commodity in the current year while p Q , q 0 indicate the price and 
quantity of the commodity in the base year. 


19*8. Fisher’s Ideal Index Number. Fisher gave the follow¬ 
ing formula for index number :— 


01 


'0°. 

V \Zp 0 q 0 ZPtf i ) 


where/?,, q x% p 0 and q 0 are as explained above. 

The above formula satisfies the time reversal and factor 
reversal tests as proved below :— 


INDEX NUMBERS 


487 


Time Reversal Test. 

hx X / 


10 






= 100 a . 

If instead of percentages, we had used proportions, we 
would have 

7 0 i x /io= I. 

Thus the time reversal test is satisfied. 


Factor Reversal Test. 

We have />„,= 100 . / ( ~ p 'Ss X £ / ’ 1 q ' 

V \±Po<lo Z<hPJ 




Hence 


/ > . l x0 ol =lOO“gl|l=lOOK ol . 


If we used proportions instead of percentages, we would have 

Pox X Qoi = Yox- 

Thus the Fisher’s Ideal Formula satisfies Factor Reversal 
Test as well. 


19*9. Circular Test. This is the generalisation of time 
reversal test. If P ol represents the price change of the current 
year on the base year, P l2 the price change of base year on some 
other base and P 20 the price change of the current year on the 
second base, then the equation 

^oi xFx 2 x P 2 q == 1 

should be satisfied. 

Fisher’s Ideal formula does not satisfy this test. 


19*10. Solved Examples. 

1. Prepare Index Number for 1904 on the basis of 1902 where 
the following information is given :— 


Article I 

Year Price Qnty. 

1902 5 10 

1904 4 12 


Article II 
Price Qnty. 

8 6 

7 7 


Article III 
Price Qnty. 

6 3 

5 4 

(Agra M. Com. ’47) 



488 ' 


STATISTICS 


Article 

1902 

i 

1904 

P,<h 

Poto 

Pi4o 

Po<h 

Price Po 

Qnty. q Q 

Price Pi 

Qnty. q x 

I 

5 

10 

4 

12 

48 

50 

120 

20 

II 

8 

6 " 

7 

7 ~ i 

49 

48 

56 

56 

III 

6 

i 

3 

5 ■ 

4 

20 

1 

18 

1 

24 

30 

| 

Total 



| 

i 


117 ! 

f 

116 

200 

106 


Applying Fisher’s formula. 


= 145. 

2. From the fixed base index number given below, prepare 
chain base index numbers. 


1935 

1936 

1937 

1938 1939 1940 

94 

98 

102 

95 98 100- 




(Agra B. Com. ’43) 

Year Fixed base index number 

Chain ■ base index number • 

1935 


94 

100 

1936 


98 

98 

^ x 100=104*3 ' 

1937 


102 

^x 100=104-1 

1938 


95 

£ 2 x 100=93-1 

!9;-9 


98 

98 

jg-x 100=103*2 

1940 


100 

~X100=102 0 


3. The following are the group index numbers and the group 
weights of an average workingclass family budget. Construct the 
cost of living index number by assigning the given weights. 








INDEX NUMBERS 


489 


Groups 

Weights 

Food 

48 

Fuel and lighting 

10 

Clothing 

8 

House Rent 

12 

Miscellaneous 

190 


Index Number for April 1944 

352 

220 

230 

160 

15 

(I.A.S. ’50) 


The required Index No. 

(48 x 352)4-(10 x 220)-4-(8 x 230)+ (12 x 160) + (190 x 15) 
“ 48 + 10 + 8+12-1-190 


_25706 

268 
= 95*9. 

4 „ From the chain base index numbers given below, prepare 
fixed base index numbers :— 

1945 1946 1947 1948 1949 1950 

92 102 104 98 103 101 


If the base at 1945 is taken at 100, the index no. for 1946 is 102. 
If the base at 1945 is taken at 92, the index no. for 1946 is 


102x92 

100 


= 93 8. 


Similarly with 1945 as base = 92, the index number for 1947 


—— x X 104 = 97'6 . 

100 * 100 


Index number for 1948 = 
Index number for 1949 = 
Index number for 1950 = 


92 

100 X 

102 
100 X 

104 x Q8 — 

ioo x 98 

95*6. 

92 

102 

104 98 

x 103 = 98*5. 

100 X 

100 x 

100 x loo 

92 

102 

104 98 

103^ ,n, on c 

100 X 

100 X 

100 X 100 

* 100 X 101=99 5 


Hence the required index numbers are 

92, 93*8, 97-6, 95*6, 98*5, 99‘5. 

Note. Having determined the index number for 1946 with 
1945 as base, the index number for 1947 can be obtained by 

104 

multiplying the index number for 1946 by J(KJ , similarly that for 


1948 can be obtained by multiplying the thus calculated index 

98 

number for 1947 by and so on. 



490 


STATISTICS 


5. An enquiry into budgets of the middle class families in a 
city in England gave the following information : 


Expenses on Food Fuel and lighting Clothing Rent Misc. 



35% 

10% 

20% 

15% 

20% 

Prices (1928) 

£150 

£25 

-C75 

£50 

£40 

Prices (1929) 

£145 

£23 

£65 

£30 

£45. 


What changes in cost of living figures of 1929 as compared with 
that of 1928 are seen ? (Lucknow B. Com. 1944) 


Items 

• 

1928 

1929 

Wts. 

Weighted 

relatives 

Prices 

in 

£ 

1928 prices 
taken 
as 100 

Prices 

in 

£ 

Price rela¬ 
tives based 
on 1928 

Food 

150 

o 

o 

145 

97 

| 

35 

3395 

Fuel, light 

25 

100 

23 

92 ! 

j 

10 

1500 

Clothing 

75 

100 

65 

87 

| 20 

1740 

Rent 

1 30 

1 

100 

30 

100 

15 

920 

Misc. 

40 

1 100 

1 

45 

112*5 

20 

1 ~ 

2250 


100 9805 


Cost of living index No.— 


Z weighted relatives 
£ weights 


9805 


” 100 
= 98*05. 


INDEX NUMBERS 


491 


6. Construct with the help of data given below Fisher's IdeaI 
Index and show how it satisfies the Factor Reversal Test. 



Estimated total produce 

Harvest Price 


in thousand tons in saran 

per niaund in district 


district 

Saran 


f 

1931—32 1932—33 

1931—32 1932—33 


Winter Rice 

71 

26 

l<s. as. 

3 8 

Rs. as 
3 2 

Barley 

107 

83 

2 0 

2 0 

Maize 

62 

48 

2 9 

2 9 


(Patna M. A. *42, Vikram B. A. ’60) 

Base year 1931 — 32. 


(/> 


t: 

< 


Winter Rice 

Barley 

Maiz 


c 

o 

o o . 

I 8 jg 

CO rv ' S3 _ 

CQ ^ CQ CT 


* : i> Jr • <u ^ 

c c u- cr 

u ^ ^ u A ^ u. 

“.SO 5 


ruvo 


r \ w 

CJ w 


56 


71 


32 107 


41 


62 


Total 


Fisher’s Ideal Formula : 


50 

30 

28 


| 

26 3976,1300 3550 1456 

83 3424 2490 3210 2656 


48 2542 1344 1726 1698 


9942 5134 


8486 6080 


^oi — 1 — 


v x— J'O-JO ~ f'ovis 

/ /8486 5134\ i/x/v 

V V9942 608 o) X 100 

84*9. 


Gti* 100 


/ ( Z Po<li 

V U P.q 9 



s Pi<h\ 

£ PtfJ 





492 


STATISTICS 




6080 

9942 



5134\ 

8486/ 


= 60*8 



£ P ijh 
^ P^o 



5134 

9942 


= 51*6. 

Now / > ol x0 ol =84 , 9x6O , 8 

= 51*6x100 
= V 01 x 100. 

Hence the factor reversal test is satisfied. 

Exercises 

1. Find the Index Number (1) by taking 1930 as the base, 
(2) the average of the first three years as base, (3) 1940 as 
base. 

Year 1930 1931 1932 1933 1934 1940 1941 1942 1943 
Price of 

wheat 4 5 6 7 8 10 9 10 11 

per md. 

[Ans. (1) 100, 125, 150, 175, 187, 250, 225, 250, 275. 

(2) 80, 100, 120, 140, 150, 200, 180, 200, 220. 

(3) 40, 50, 60, 70, 75, 100, 90, 100, 110.] 

2. Prepare the index number of prices for three years with 
average price as base. 

Wheat 
10 seers 
9 seers 
9 seers 

(Agra B. Com. ’58) 
[Ans. 91, 98, 100] 


I Year 

II Year 

III Year 




Rate per Rupee 

Cotton 
4 seers 
3.§ seers 
3 seers 


Oil 

3 seers 
3 seers 
2\ seers 



INDEX NUMBERS 


493 


Construct the cost of living index for 1944 from the following 
data :— 

Groups Weights Group Index No. for 1944 


Food 

47 

247 

Fuel and lighting 

7 

293 

Clothing 

8 

289 

House Rent 

13 

100 

Miscellaneous 

14 

236 


(Alld. B. Com. ’44) 
[Ans. 231] 

The following are quantities Q in millions of lbs. and values 
V in lakhs of rupees of exports of tea from India to all 
countries excluding Pakistan. With base 1954 Jan.- 
March=100, compute the appropriate index numbers of tea 
export and index numbers of export price of tea. 



Year and quarter 

Q 

V 

1954 

Jan—March 

88*4 

21,22 

1954 

April—June 

49*1 

13.43 

1954 

July—Sept. 

133*3 

37,60 

1954 

Oct.— Dec. 

176*2 

58,58 

1955 

Jan.— March 

96 8 

36.88 


fAns. (Q 100, 55*54, 

1 50*79, 199*32, 109 53) 


(V 100, 1 13*9, 117*5, 138*5, 158*7)] 

Compute an appropriate index number for purposes of 
comparison from the following data :— 

I I 

i Rice ; Wheat Jowar 

Year-i-—~- 1 --i- 

Price Quantity Price [Quantity Price Quantity 

i I I 

1935 4 50 3 10 2 5 

I 

1945 10 40 8 8 4 j 4 

Prices and quantities are stated in arbitrary units. 

(I. A. S. ’56) 



494 


STATISTICS 


6. Prove using the following data that the factor reversal test, is 


satisfied by Fisher’s Ideal Formula for Index Number:— 

Commodity 

Base year 
price 

Base year Current year 
quantity price 

Current year 
quantity 

A 

6 

50 

10 

56 

B 

2 

100 ' 

2 

120 

C 

4 

60 

6 

60 

D 

10 

30 

12 

24 

E 

8 

40 

12 

26 


(Delhi B. Com. ’53) 

7. An enquiry into the budgets of the middle class families in 
a city gave the following information :— 


Expenses on 

Food 

Rent 

Clothing 

Fuel 

Others 


30% 

15% 

20 % 

10 % 

25% 

Prices (1947) 

100 /- 

20 /- 

70/- 

20 /- 

40/- 

Prices (1948) 

90/- 

20 /- 

60/- 

15/- 

55/- 


What are the changes in cost of living figures of 1948 as 
compared with those of 1947 ?. 

[Ans. cost of living rises by 7% approx.] 





